ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization (2402.14528v5)
Abstract: The varying significance of distinct primitive behaviors during the policy learning process has been overlooked by prior model-free RL algorithms. Leveraging this insight, we explore the causal relationship between different action dimensions and rewards to evaluate the significance of various primitive behaviors during training. We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration. Furthermore, to prevent excessive focus on specific primitive behaviors, we analyze the gradient dormancy phenomenon and introduce a dormancy-guided reset mechanism to further enhance the efficacy of our method. Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks spanning 7 domains compared to model-free RL baselines, which underscores the effectiveness, versatility, and efficient sample efficiency of our approach. Benchmark results and videos are available at https://ace-rl.github.io/.
- Robel: Robotics benchmarks for learning with low-cost robots. In Conference on Robot Learning, 2020a.
- Robel: Robotics benchmarks for learning with low-cost robots. In Conference on robot learning, 2020b.
- Robel: Robotics benchmarks for learning with low-cost robots. In Conference on robot learning, pp. 1300–1313. PMLR, 2020c.
- On warm-starting neural network training. Advances in neural information processing systems, 33:3884–3894, 2020.
- Never give up: Learning directed exploration strategies. In ICLR. OpenReview.net, 2020.
- Causality and batch reinforcement learning: Complementary approaches to planning in unknown domains. arXiv preprint arXiv:2006.02579, 2020.
- Unifying count-based exploration and intrinsic motivation. In NIPS, pp. 1471–1479, 2016.
- Dagma: Learning dags via m-matrices and a log-determinant acyclicity characterization. Advances in Neural Information Processing Systems, 35:8226–8239, 2022.
- Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
- Ucb exploration via q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
- Causal reinforcement learning: A survey. arXiv preprint arXiv:2307.01452, 2023.
- Factored adaptation for non-stationary reinforcement learning. arXiv preprint arXiv:2203.16582, 2022.
- Addressing function approximation error in actor-critic methods. Proceedings of Machine Learning Research, 80:1587–1596, 2018.
- panda-gym: Open-source goal-conditioned environments for robotic learning. arXiv preprint arXiv:2106.13687, 2021a.
- panda-gym: Open-source goal-conditioned environments for robotic learning. arXiv preprint arXiv:2106.13687, 2021b.
- Gershman, S. J. Reinforcement learning and causal models. The Oxford handbook of causal reasoning, 1:295, 2017.
- Soft q-learning with mutual-information regularization. In International conference on learning representations, 2018.
- Leveraging approximate symbolic models for reinforcement learning via skill diversity. In International Conference on Machine Learning, 2022.
- Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp. 1352–1361. PMLR, 2017.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
- Diversity actor-critic: Sample-aware entropy regularization for sample-efficient exploration. In International Conference on Machine Learning, 2021a.
- A max-min entropy framework for reinforcement learning. Advances in Neural Information Processing Systems, 34:25732–25745, 2021b.
- Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681–2691. PMLR, 2019a.
- Provably efficient maximum entropy exploration. In ICML, volume 97 of Proceedings of Machine Learning Research, pp. 2681–2691. PMLR, 2019b.
- Causal discovery from heterogeneous/nonstationary data. The Journal of Machine Learning Research, 21(1):3482–3534, 2020.
- Adarl: What, where, and how to adapt in transfer reinforcement learning. International Conference on Learning Representations, 2022a.
- Action-sufficient state representation learning for control with structural constraints. In International Conference on Machine Learning, 2022b.
- Seizing serendipity: Exploiting the value of past success in off-policy actor-critic. arXiv preprint arXiv:2306.02865, 2023a.
- Seizing serendipity: Exploiting the value of past success in off-policy actor-critic. arXiv preprint arXiv:2306.02865, 2023b.
- Reward-free exploration for reinforcement learning. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 4870–4879. PMLR, 2020.
- Exploration in reinforcement learning with deep covering options. In International Conference on Learning Representations, 2019.
- Adaptive reward-free exploration. In ALT, volume 132 of Proceedings of Machine Learning Research, pp. 865–891. PMLR, 2021.
- Exploration in deep reinforcement learning: A survey. Information Fusion, 85:1–22, 2022.
- Learning quadrupedal locomotion over challenging terrain. Sci. Robotics, 5(47):5986, 2020.
- Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019.
- Guided policy search. In International conference on machine learning, pp. 1–9. PMLR, 2013.
- Hierarchical reinforcement learning integrating with human knowledge for practical robot skill learning in complex multi-stage manipulation. IEEE Transactions on Automation Science and Engineering, 2023.
- Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in neural information processing systems, 2012.
- Luczkow, V. Structural Causal Models for Reinforcement Learning. McGill University (Canada), 2021.
- Temporal abstraction in reinforcement learning with the successor representation. Journal of Machine Learning Research, 24(80):1–69, 2023.
- UCB momentum q-learning: Correcting the bias without forgetting. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 7609–7618. PMLR, 2021.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Murphy, K. P. Dynamic bayesian networks: representation, inference and learning. University of California, Berkeley, 2002.
- The importance of non-markovianity in maximum state entropy exploration. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 16223–16239. PMLR, 2022.
- Data-efficient hierarchical reinforcement learning. In Advances in neural information processing systems, 2018.
- The primacy bias in deep reinforcement learning. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 16828–16847. PMLR, 2022.
- Deep reinforcement learning with plasticity injection. arXiv preprint arXiv:2305.15555, 2023.
- Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626, 2016.
- Count-based exploration with neural density models. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 2721–2730. PMLR, 2017.
- Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5):1–35, 2021.
- Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pp. 2778–2787. PMLR, 2017.
- Pearl, J. Causality. Cambridge university press, 2009.
- Multi-goal reinforcement learning: Challenging robotics environments and request for research, 2018.
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Robotics: Science and Systems, 2018.
- Brain-wide mapping reveals that engrams for a single memory are distributed across multiple brain regions. Nature communications, 13(1):1799, 2022.
- Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
- Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017a.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
- Planning to explore via self-supervised world models. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 8583–8592. PMLR, 2020.
- State entropy maximization with random encoders for efficient exploration. In International Conference on Machine Learning, pp. 9443–9454. PMLR, 2021.
- DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research, 12(4):1225–1248, 2011.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- The dormant neuron phenomenon in deep reinforcement learning. In ICML, volume 202 of Proceedings of Machine Learning Research, pp. 32145–32168. PMLR, 2023.
- Causation, Prediction, and Search. MIT press, 2000.
- Optimistic curiosity exploration and conservative exploitation with linear reward shaping. In Advances in Neural Information Processing Systems, 2022.
- Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
- Reinforcement learning: An introduction. MIT press, 2018.
- Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- #exploration: A study of count-based exploration for deep reinforcement learning. In NIPS, pp. 2753–2762, 2017.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018a.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018b.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026–5033. IEEE, 2012a.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 2012b.
- Deep reinforcement learning with double q-learning. In the AAAI conference on artificial intelligence, 2016.
- Drm: Mastering visual reinforcement learning through dormant ratio minimization. arXiv preprint arXiv:2310.19668, 2023.
- CEM: constrained entropy maximization for task-agnostic safe exploration. In AAAI, pp. 10798–10806. AAAI Press, 2023.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, volume 100 of Proceedings of Machine Learning Research, pp. 1094–1100. PMLR, 2019a.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, 2019b.
- A survey on causal reinforcement learning. arXiv preprint arXiv:2302.05209, 2023.
- Maximum entropy-regularized multi-goal reinforcement learning. In International Conference on Machine Learning, pp. 7553–7562. PMLR, 2019.
- Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.
- Tianying Ji (12 papers)
- Yongyuan Liang (18 papers)
- Yan Zeng (46 papers)
- Yu Luo (143 papers)
- Guowei Xu (15 papers)
- Jiawei Guo (16 papers)
- Ruijie Zheng (23 papers)
- Furong Huang (150 papers)
- Fuchun Sun (127 papers)
- Huazhe Xu (93 papers)