Exploring Model-based Planning with Policy Networks (1906.08649v1)

Published 20 Jun 2019 in cs.LG, cs.AI, cs.RO, and stat.ML

Abstract: Model-based reinforcement learning (MBRL) with model-predictive control or online planning has shown great potential for locomotion control tasks in terms of both sample efficiency and asymptotic performance. Despite their initial successes, the existing planning methods search from candidate sequences randomly generated in the action space, which is inefficient in complex high-dimensional environments. In this paper, we propose a novel MBRL algorithm, model-based policy planning (POPLIN), that combines policy networks with online planning. More specifically, we formulate action planning at each time-step as an optimization problem using neural networks. We experiment with both optimization w.r.t. the action sequences initialized from the policy network, and also online optimization directly w.r.t. the parameters of the policy network. We show that POPLIN obtains state-of-the-art performance in the MuJoCo benchmarking environments, being about 3x more sample efficient than the state-of-the-art algorithms, such as PETS, TD3 and SAC. To explain the effectiveness of our algorithm, we show that the optimization surface in parameter space is smoother than in action space. Further more, we found the distilled policy network can be effectively applied without the expansive model predictive control during test time for some environments such as Cheetah. Code is released in https://github.com/WilsonWangTHU/POPLIN.

Authors (2)

Tingwu Wang (9 papers)
Jimmy Ba (55 papers)

Citations (142)

View on Semantic Scholar

Summary

Exploring Model-Based Planning with Policy Networks

This paper introduces a new model-based reinforcement learning (MBRL) algorithm termed model-based policy planning (POPLIN), which aims to improve the sample efficiency and asymptotic performance of traditional MBRL techniques that rely on model-predictive control (MPC) or online planning, particularly in complex high-dimensional environments. Traditional MBRL methods often utilize random search techniques within the action space for planning, which may become inefficient as the dimensional complexity of the task increases.

The authors propose combining policy networks with online planning to enhance sample efficiency. The key innovation of POPLIN lies in framing action planning at each time step as an optimization problem and exploring two primary methods: optimizing action sequences initialized from a policy network or optimizing directly on the parameters of the policy network. This is a departure from random search methodologies, aiming to address their limitations in scenario complexity.

The paper reports that POPLIN achieves state-of-the-art results on several MuJoCo benchmark environments, exhibiting approximately a threefold increase in sample efficiency relative to previous state-of-the-art algorithms, such as PETS, TD3, and SAC. The authors attribute this to the smoother optimization surface found in parameter space compared to action space. They demonstrate that the policy network, once distilled, could effectively be used without the need for computationally expensive MPC during test time in situations like the Cheetah environment.

Three principal contributions outlined in the paper include:

Using policy networks to generate proposals for MPC in high-dimensional locomotion control problems with previously unknown dynamics.
Reformulating planning as an optimization problem with neural networks using policy planning in parameter space, achieving substantial improvements over existing methods in benchmarking environments.
Exploring policy network distillation from planned trajectories, where the distilled network competently performs in environments like Cheetah without requiring extensive online planning.

The POPLIN algorithm is delineated into two variants: model-based policy planning in action space (POPLIN-A) and model-based policy planning in parameter space (POPLIN-P). POPLIN-A uses policy networks to propose initial action sequences and refines them using CEM, while POPLIN-P proposes noise in the policy network's parameter space. The latter variant shows superior performance in complex environments due to its efficient search process over smoother optimization surfaces.

From a theoretical standpoint, this work advances the discourse on using policy networks in conjunction with model-based planning, suggesting the potential scalability of POPLIN to more intricate tasks. Practically, the implications for real-time deployment and reduced reliance on extensive computational planning are significant, particularly for applications where ENVIRONMENT dynamics are uncertain or costly to evaluate.

The authors suggest that their exploratory findings might open new avenues for incorporating richer policy network architectures and variants of optimization techniques. Future developments could explore integrating POPLIN within broader AI control systems, particularly those demanding a balance between operational efficiency and high-dimensional decision-making agility. The release of their accompanying codebase also encourages further benchmarking and refinement by the community.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - WilsonWangTHU/POPLIN (99 stars)