Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data (2403.00564v2)

Published 1 Mar 2024 in cs.LG, cs.AI, and cs.RO

Abstract: Sample efficiency remains a crucial challenge in applying Reinforcement Learning (RL) to real-world tasks. While recent algorithms have made significant strides in improving sample efficiency, none have achieved consistently superior performance across diverse domains. In this paper, we introduce EfficientZero V2, a general framework designed for sample-efficient RL algorithms. We have expanded the performance of EfficientZero to multiple domains, encompassing both continuous and discrete actions, as well as visual and low-dimensional inputs. With a series of improvements we propose, EfficientZero V2 outperforms the current state-of-the-art (SOTA) by a significant margin in diverse tasks under the limited data setting. EfficientZero V2 exhibits a notable advancement over the prevailing general algorithm, DreamerV3, achieving superior outcomes in 50 of 66 evaluated tasks across diverse benchmarks, such as Atari 100k, Proprio Control, and Vision Control.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  2. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
  3. Planning in stochastic environments with a learned model. In International Conference on Learning Representations, 2021.
  4. Bellman, R. A markovian decision process. Journal of mathematics and mechanics, pp.  679–684, 1957.
  5. Openai gym, 2016.
  6. A system for general in-hand object re-orientation. In Conference on Robot Learning, pp.  297–307. PMLR, 2022.
  7. Visual dexterity: In-hand reorientation of novel and complex object shapes. Science Robotics, 8(84):eadc9244, 2023. doi: 10.1126/scirobotics.adc9244. URL https://www.science.org/doi/abs/10.1126/scirobotics.adc9244.
  8. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  15750–15758, 2021.
  9. Coulom, R. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp.  72–83. Springer, 2006.
  10. Policy improvement by planning with gumbel. In International Conference on Learning Representations, 2021.
  11. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  12. Model-based value expansion for efficient model-free reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.
  13. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.  1861–1870. PMLR, 2018.
  14. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  15. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=0oabwyZbOu.
  16. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  17. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022.
  18. Td-mpc2: Scalable, robust world models for continuous control, 2023.
  19. Learning and planning in complex action spaces. In International Conference on Machine Learning, pp.  4476–4486. PMLR, 2021.
  20. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26):eaau5872, 2019.
  21. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
  22. Almost optimal exploration in multi-armed bandits. In International conference on machine learning, pp.  1238–1246. PMLR, 2013.
  23. Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement. In International Conference on Machine Learning, pp.  3499–3508. PMLR, 2019.
  24. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp.  5639–5650. PMLR, 2020.
  25. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  26. Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training. arXiv preprint arXiv:2305.12127, 2023.
  27. Rubinstein, R. Y. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.
  28. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  29. Online and offline reinforcement learning by planning with a learned model. Advances in Neural Information Processing Systems, 34:27580–27591, 2021.
  30. Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929, 2020.
  31. Bigger, better, faster: Human-level atari with human-level efficiency. In International Conference on Machine Learning, pp.  30365–30380. PMLR, 2023.
  32. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  33. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
  34. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  35. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  36. Daydreamer: World models for physical robot learning, 2022.
  37. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp.  10524–10533. PMLR, 2020.
  38. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
  39. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021.
  40. Efficient learning for alphazero via path consistency. In International Conference on Machine Learning, pp.  26971–26981. PMLR, 2022.
  41. Improving deep neural networks using softplus units. In 2015 International joint conference on neural networks (IJCNN), pp.  1–4. IEEE, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Shengjie Wang (29 papers)
  2. Shaohuai Liu (5 papers)
  3. Weirui Ye (9 papers)
  4. Jiacheng You (12 papers)
  5. Yang Gao (762 papers)
Citations (3)

Summary

  • The paper introduces EfficientZero V2, a novel RL algorithm that integrates sampling-based Gumbel search and search-based value estimation to drastically reduce simulation needs.
  • It demonstrates superior performance over previous methods on benchmarks like Atari 100k, Proprio Control, and Vision Control with significant score improvements.
  • The method’s robust architecture efficiently handles both discrete and continuous control tasks, paving the way for practical applications in robotics and autonomous systems.

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Introduction

EfficientZero V2 (EZ-V2) presents a substantial advancement in the field of sample-efficient Reinforcement Learning (RL) algorithms. While traditional RL methods excel in constrained settings, they translate poorly to practical applications due to massive data requirements. By leveraging a series of innovative techniques, EZ-V2 achieves superior performance across a diverse set of domains, including discrete and continuous control, and tasks with varying observation complexities. Specifically, EZ-V2 surpasses the previous state-of-the-art (SOTA) by a significant margin in various benchmarks, such as Atari 100k, Proprio Control, and Vision Control, while maintaining a low data interaction threshold.

Key Contributions

General Framework for Sample Efficient RL

The EZ-V2 framework integrates several core components to ensure high sample efficiency for both discrete and continuous action spaces, alongside visual and low-dimensional inputs. Unlike EfficientZero, EZ-V2 adopts Gumbel search for policy improvement, facilitating efficient planning with fewer simulations. This framework is employed seamlessly across multiple domains, ensuring consistent performance gains.

Enhanced Planning via Sampling-Based Gumbel Search

To address high-dimensional continuous action spaces, EZ-V2 proposes a novel sampling-based Gumbel search for action planning. This method significantly enhances exploration and guarantees policy improvement, even with limited simulation budgets. Consequently, the required number of simulations is substantially reduced, making the algorithm computationally efficient.

Search-Based Value Estimation

EZ-V2 introduces a search-based value estimation method, leveraging the latest policy and model to conduct more accurate value estimations by utilizing imagined trajectories. This method, termed Search-Based Value Estimation (SVE), effectively mitigates off-policy issues associated with early-stage transitions. SVE combines the latest policy and multi-step TD targets to offer a superior value estimation framework, ensuring robust performance improvements.

Action Embedding and Gaussian Policy

EZ-V2 encodes actions within a compact latent space through action embeddings, ensuring efficient policy representation. Coupled with a Gaussian policy parameterized by the learnable policy function, the algorithm balances exploration-exploitation dynamics, further enhancing planning efficiency.

Experimental Outcomes

Performance on Atari 100k

EZ-V2 exhibits outstanding performance on the Atari 100k benchmark, achieving a normalized mean score of 2.428 and a median score of 1.286, thus surpassing EfficientZero and BBF. A comprehensive set of experiments demonstrates that EZ-V2 outperforms DreamerV3 on 50 out of 66 tasks across various benchmarks, establishing a new SOTA in multiple domains.

Robustness in Proprio and Vision Control

In continuous control settings, EZ-V2 was tested on Proprio Control and Vision Control benchmarks, each comprising tasks under different observational complexities and action spaces. Particularly, EZ-V2 achieves a mean score of 723.2 in the Proprio Control benchmark and 726.1 in Vision Control tasks, significantly outperforming prior top-performing methods like TD-MPC2 and DreamerV3. These results underscore EZ-V2's ability to generalize and maintain high sample efficiency across disparate RL environments.

Implications and Future Directions

The theoretical and empirical advancements demonstrated by EZ-V2 have ripe implications both in practical and theoretical realms. Practically, the significant reduction in interaction data required for training opens up potential for real-world applications involving expensive or hazardous data collection, such as robotics and autonomous driving. Theoretically, the methods introduced in search-based value estimation and sampling-based Gumbel search provide fertile ground for further exploration and optimization of planning algorithms within model-based RL frameworks.

However, future research needs to address the challenges of integrating safety and risk considerations in real-world scenarios, particularly those involving stochastic dynamics and real-time decision-making constraints.

Conclusion

EfficientZero V2 (EZ-V2) successfully transcends the limitations of prior RL algorithms by introducing enhanced planning and value estimation techniques. Through careful consideration of computational efficiency alongside superior policy and value improvements, EZ-V2 sets a new benchmark in sample efficiency for diverse RL tasks. Future work will focus on scaling and verifying these advancements in broader real-world applications while integrating safety mechanisms for practical deployment.