EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data (2403.00564v2)

Published 1 Mar 2024 in cs.LG, cs.AI, and cs.RO

Abstract: Sample efficiency remains a crucial challenge in applying Reinforcement Learning (RL) to real-world tasks. While recent algorithms have made significant strides in improving sample efficiency, none have achieved consistently superior performance across diverse domains. In this paper, we introduce EfficientZero V2, a general framework designed for sample-efficient RL algorithms. We have expanded the performance of EfficientZero to multiple domains, encompassing both continuous and discrete actions, as well as visual and low-dimensional inputs. With a series of improvements we propose, EfficientZero V2 outperforms the current state-of-the-art (SOTA) by a significant margin in diverse tasks under the limited data setting. EfficientZero V2 exhibits a notable advancement over the prevailing general algorithm, DreamerV3, achieving superior outcomes in 50 of 66 evaluated tasks across diverse benchmarks, such as Atari 100k, Proprio Control, and Vision Control.

References (41)

Authors (5)

Shengjie Wang (29 papers)
Shaohuai Liu (5 papers)
Weirui Ye (9 papers)
Jiacheng You (12 papers)
Yang Gao (762 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces EfficientZero V2, a novel RL algorithm that integrates sampling-based Gumbel search and search-based value estimation to drastically reduce simulation needs.
It demonstrates superior performance over previous methods on benchmarks like Atari 100k, Proprio Control, and Vision Control with significant score improvements.
The method’s robust architecture efficiently handles both discrete and continuous control tasks, paving the way for practical applications in robotics and autonomous systems.

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Introduction

EfficientZero V2 (EZ-V2) presents a substantial advancement in the field of sample-efficient Reinforcement Learning (RL) algorithms. While traditional RL methods excel in constrained settings, they translate poorly to practical applications due to massive data requirements. By leveraging a series of innovative techniques, EZ-V2 achieves superior performance across a diverse set of domains, including discrete and continuous control, and tasks with varying observation complexities. Specifically, EZ-V2 surpasses the previous state-of-the-art (SOTA) by a significant margin in various benchmarks, such as Atari 100k, Proprio Control, and Vision Control, while maintaining a low data interaction threshold.

Key Contributions

General Framework for Sample Efficient RL

The EZ-V2 framework integrates several core components to ensure high sample efficiency for both discrete and continuous action spaces, alongside visual and low-dimensional inputs. Unlike EfficientZero, EZ-V2 adopts Gumbel search for policy improvement, facilitating efficient planning with fewer simulations. This framework is employed seamlessly across multiple domains, ensuring consistent performance gains.

Enhanced Planning via Sampling-Based Gumbel Search

To address high-dimensional continuous action spaces, EZ-V2 proposes a novel sampling-based Gumbel search for action planning. This method significantly enhances exploration and guarantees policy improvement, even with limited simulation budgets. Consequently, the required number of simulations is substantially reduced, making the algorithm computationally efficient.

Search-Based Value Estimation

EZ-V2 introduces a search-based value estimation method, leveraging the latest policy and model to conduct more accurate value estimations by utilizing imagined trajectories. This method, termed Search-Based Value Estimation (SVE), effectively mitigates off-policy issues associated with early-stage transitions. SVE combines the latest policy and multi-step TD targets to offer a superior value estimation framework, ensuring robust performance improvements.

Action Embedding and Gaussian Policy

EZ-V2 encodes actions within a compact latent space through action embeddings, ensuring efficient policy representation. Coupled with a Gaussian policy parameterized by the learnable policy function, the algorithm balances exploration-exploitation dynamics, further enhancing planning efficiency.

Experimental Outcomes

Performance on Atari 100k

EZ-V2 exhibits outstanding performance on the Atari 100k benchmark, achieving a normalized mean score of 2.428 and a median score of 1.286, thus surpassing EfficientZero and BBF. A comprehensive set of experiments demonstrates that EZ-V2 outperforms DreamerV3 on 50 out of 66 tasks across various benchmarks, establishing a new SOTA in multiple domains.

Robustness in Proprio and Vision Control

In continuous control settings, EZ-V2 was tested on Proprio Control and Vision Control benchmarks, each comprising tasks under different observational complexities and action spaces. Particularly, EZ-V2 achieves a mean score of 723.2 in the Proprio Control benchmark and 726.1 in Vision Control tasks, significantly outperforming prior top-performing methods like TD-MPC2 and DreamerV3. These results underscore EZ-V2's ability to generalize and maintain high sample efficiency across disparate RL environments.

Implications and Future Directions

The theoretical and empirical advancements demonstrated by EZ-V2 have ripe implications both in practical and theoretical realms. Practically, the significant reduction in interaction data required for training opens up potential for real-world applications involving expensive or hazardous data collection, such as robotics and autonomous driving. Theoretically, the methods introduced in search-based value estimation and sampling-based Gumbel search provide fertile ground for further exploration and optimization of planning algorithms within model-based RL frameworks.

However, future research needs to address the challenges of integrating safety and risk considerations in real-world scenarios, particularly those involving stochastic dynamics and real-time decision-making constraints.

Conclusion

EfficientZero V2 (EZ-V2) successfully transcends the limitations of prior RL algorithms by introducing enhanced planning and value estimation techniques. Through careful consideration of computational efficiency alongside superior policy and value improvements, EZ-V2 sets a new benchmark in sample efficiency for diverse RL tasks. Future work will focus on scaling and verifying these advancements in broader real-world applications while integrating safety mechanisms for practical deployment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1764508918711992421

https://twitter.com/gao_young/status/1765680141869699534

https://twitter.com/fly51fly/status/1766582469812465783

https://twitter.com/OWW/status/1834779960638091280

https://twitter.com/knishimae0531/status/1764790476035735919

https://twitter.com/null1six/status/1804816970119893466