Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting Experience Replayable Conditions (2402.10374v2)

Published 15 Feb 2024 in cs.LG

Abstract: Experience replay (ER) used in (deep) reinforcement learning is considered to be applicable only to off-policy algorithms. However, there have been some cases in which ER has been applied for on-policy algorithms, suggesting that off-policyness might be a sufficient condition for applying ER. This paper reconsiders more strict "experience replayable conditions" (ERC) and proposes the way of modifying the existing algorithms to satisfy ERC. In light of this, it is postulated that the instability of policy improvements represents a pivotal factor in ERC. The instability factors are revealed from the viewpoint of metric learning as i) repulsive forces from negative samples and ii) replays of inappropriate experiences. Accordingly, the corresponding stabilization tricks are derived. As a result, it is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm. Moreover, its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Bejjani W, Papallas R, Leonetti M, et al (2018) Planning with a receding horizon for manipulation in clutter using a learned value function. 1803.08100 Bellet et al [2022] Bellet A, Habrard A, Sebban M (2022) Metric learning. Springer Nature Caggiano et al [2022] Caggiano V, Wang H, Durandau G, et al (2022) Myosuite–a contact-rich simulation suite for musculoskeletal motor control. arXiv preprint arXiv:220513600 Chen et al [2021] Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Bellet A, Habrard A, Sebban M (2022) Metric learning. Springer Nature Caggiano et al [2022] Caggiano V, Wang H, Durandau G, et al (2022) Myosuite–a contact-rich simulation suite for musculoskeletal motor control. arXiv preprint arXiv:220513600 Chen et al [2021] Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Caggiano V, Wang H, Durandau G, et al (2022) Myosuite–a contact-rich simulation suite for musculoskeletal motor control. arXiv preprint arXiv:220513600 Chen et al [2021] Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  2. Bellet A, Habrard A, Sebban M (2022) Metric learning. Springer Nature Caggiano et al [2022] Caggiano V, Wang H, Durandau G, et al (2022) Myosuite–a contact-rich simulation suite for musculoskeletal motor control. arXiv preprint arXiv:220513600 Chen et al [2021] Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Caggiano V, Wang H, Durandau G, et al (2022) Myosuite–a contact-rich simulation suite for musculoskeletal motor control. arXiv preprint arXiv:220513600 Chen et al [2021] Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  3. Caggiano V, Wang H, Durandau G, et al (2022) Myosuite–a contact-rich simulation suite for musculoskeletal motor control. arXiv preprint arXiv:220513600 Chen et al [2021] Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  4. Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 23(6):5068–5078 Cheng et al [2016] Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  5. Cheng D, Gong Y, Zhou S, et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344 Cui et al [2021] Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  6. Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38(3):331–354 Degris et al [2012] Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  7. Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International Conference on Machine Learning Fakoor et al [2020] Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  8. Fakoor R, Chaudhari P, Smola AJ (2020) P3O: Policy-on policy-off policy optimization. In: Uncertainty in Artificial Intelligence, PMLR, pp 1017–1027 Fedus et al [2020] Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  9. Fedus W, Ramachandran P, Agarwal R, et al (2020) Revisiting fundamentals of experience replay. In: International Conference on Machine Learning, PMLR, pp 3061–3071 Fujimoto et al [2018] Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  10. Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning, PMLR, pp 1587–1596 Ganin et al [2016] Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  11. Ganin Y, Ustinova E, Ajakan H, et al (2016) Domain-adversarial training of neural networks. Journal of machine learning research 17(59):1–35 Gu et al [2017] Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  12. Gu SS, Lillicrap T, Turner RE, et al (2017) Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. Advances in neural information processing systems 30 Haarnoja et al [2018a] Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  13. Haarnoja T, Zhou A, Abbeel P, et al (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870 Haarnoja et al [2018b] Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  14. Haarnoja T, Zhou A, Hartikainen K, et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:181205905 Hambly et al [2023] Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  15. Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Mathematical Finance 33(3):437–503 Hansen et al [2018] Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  16. Hansen S, Pritzel A, Sprechmann P, et al (2018) Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems 31 Ilboudo et al [2023] Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  17. Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692 Kalashnikov et al [2018] Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  18. Kalashnikov D, Irpan A, Pastor P, et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, PMLR, pp 651–673 Kapturowski et al [2019] Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  19. Kapturowski S, Ostrovski G, Quan J, et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International Conference on Learning Representations Kobayashi [2019] Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  20. Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49(12):4335–4347 Kobayashi [2022a] Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  21. Kobayashi T (2022a) Consolidated adaptive t-soft update for deep reinforcement learning. arXiv preprint arXiv:220212504 Kobayashi [2022b] Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  22. Kobayashi T (2022b) L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp 4032–4039 Kobayashi [2022c] Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  23. Kobayashi T (2022c) Optimistic reinforcement learning by forward kullback–leibler divergence optimization. Neural Networks 152:169–180 Kobayashi [2023a] Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  24. Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:230812772 Kobayashi [2023b] Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  25. Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results in Control and Optimization 10:100192 Kobayashi [2023c] Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  26. Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:230304356 Kobayashi and Aotani [2023] Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  27. Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Advanced Robotics pp 1–18 Levine [2018] Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  28. Levine S (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:180500909 Lillicrap et al [2015] Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  29. Lillicrap TP, Hunt JJ, Pritzel A, et al (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:150902971 Lin [1992] Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  30. Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8(3-4):293–321 Mnih et al [2015] Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  31. Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through deep reinforcement learning. nature 518(7540):529–533 Mnih et al [2016] Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  32. Mnih V, Badia AP, Mirza M, et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, PMLR, pp 1928–1937 Novati and Koumoutsakos [2019] Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  33. Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International Conference on Machine Learning, PMLR, pp 4851–4860 Oh et al [2021] Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  34. Oh I, Rho S, Moon S, et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Transactions on Games 14(2):212–220 Osband et al [2018] Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  35. Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems 31 Parmas and Sugiyama [2021] Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  36. Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 4078–4086 Paszke et al [2019] Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  37. Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 Saglam et al [2023] Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  38. Saglam B, Mutlu FB, Cicek DC, et al (2023) Actor prioritized experience replay. Journal of Artificial Intelligence Research 78:639–672 Schaul et al [2015] Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  39. Schaul T, Quan J, Antonoglou I, et al (2015) Prioritized experience replay. arXiv preprint arXiv:151105952 Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  40. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823 Schulman et al [2017] Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  41. Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:170706347 Sinha et al [2022] Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  42. Sinha S, Song J, Garg A, et al (2022) Experience replay with likelihood-free importance weights. In: Learning for Dynamics and Control Conference, PMLR, pp 110–123 Srivastava et al [2014] Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  43. Srivastava N, Hinton G, Krizhevsky A, et al (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958 Stooke et al [2020] Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  44. Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International Conference on Machine Learning, PMLR, pp 9133–9143 Sutton and Barto [2018] Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  45. Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press Todorov et al [2012] Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  46. Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems, IEEE, pp 5026–5033 Tunyasuvunakool et al [2020] Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  47. Tunyasuvunakool S, Muldal A, Doron Y, et al (2020) dm_control: Software and tasks for continuous control. Software Impacts 6:100022 Van Seijen et al [2009] Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  48. Van Seijen H, Van Hasselt H, Whiteson S, et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp 177–184 Wang et al [2014] Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  49. Wang J, Song Y, Leung T, et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393 Wang et al [2021] Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  50. Wang X, Song J, Qi P, et al (2021) Scc: An efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning, PMLR, pp 10905–10915 Wang et al [2017] Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  51. Wang Z, Bapst V, Heess N, et al (2017) Sample efficient actor-critic with experience replay. In: International Conference on Learning Representations Wu et al [2023] Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  52. Wu P, Escontrela A, Hafner D, et al (2023) Daydreamer: World models for physical robot learning. In: Conference on Robot Learning, PMLR, pp 2226–2240 Xuan et al [2020] Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  53. Xuan H, Stylianou A, Liu X, et al (2020) Hard negative examples are hard, but useful. In: European Conference on Computer Vision, pp 126–142 Yu et al [2018] Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  54. Yu B, Liu T, Gong M, et al (2018) Correcting the triplet selection bias for triplet loss. In: European Conference on Computer Vision, pp 71–87 Zhang and Sennrich [2019] Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  55. Zhang B, Sennrich R (2019) Root mean square layer normalization. Advances in Neural Information Processing Systems 32 Zhang et al [2019] Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  56. Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Advances in neural information processing systems 32 Zhao et al [2016] Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6 Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
  57. Zhao D, Wang H, Shao K, et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence, IEEE, pp 1–6
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Taisuke Kobayashi (36 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets