Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (1708.05144v2)

Published 17 Aug 2017 in cs.LG

Abstract: In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also a method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the MuJoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2- to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods. Code is available at https://github.com/openai/baselines

Citations (604)

View on Semantic Scholar

Summary

The paper introduces ACKTR, a novel approach that employs Kronecker-factored approximation to streamline natural gradient computation in deep reinforcement learning.
It significantly reduces computational overhead by efficiently scaling trust-region methods for larger, complex neural network architectures.
Experimental evaluations demonstrate that ACKTR outperforms TRPO and other baselines in sample efficiency and policy performance across diverse benchmark environments.

Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation

This paper presents a novel approach to enhancing the scalability of trust-region methods in the domain of deep reinforcement learning (DRL), leveraging the Kronecker-factored approximation to facilitate efficient, large-scale optimization processes. The method, known as ACKTR (Actor Critic using Kronecker-Factored Trust Region), is proposed as an improvement over existing trust-region methods such as the Trust Region Policy Optimization (TRPO) by integrating natural gradient updates with a scalable computation strategy.

Core Contributions

The authors introduce an innovative extension of natural gradient methods to the reinforcement learning setting that significantly reduces computational overhead without compromising the quality of convergence. Traditional trust-region methods, while robust in updating models, often suffer from high computational costs, particularly in managing large networks common in DRL. ACKTR addresses this via:

Kronecker-Factored Approximate Curvature (K-FAC): The application of K-FAC to DRL leads to computational efficiencies by approximating the Fisher Information Matrix (FIM) in a factorized form, streamlining the natural gradient computation.
Improved Scalability: By employing Kronecker-factored approximations, ACKTR is capable of working with larger neural network architectures more feasibly compared to standard approaches using full FIMs. This is critical as it allows DRL practitioners to effectively implement trust-region updates at scale.
Integration with Actor-Critic Methods: The method is applied to actor-critic based algorithms, which are highly relevant within the landscape of DRL due to their concurrent policy and value updates.

Experimental Evaluation

The paper provides a comprehensive evaluation of the proposed method across several benchmark environments, illustrating its efficacy in comparison to conventional methods. Notably:

ACKTR consistently demonstrates superior performance in terms of both sample efficiency and final policy quality across tasks compared to TRPO and other baseline algorithms.
Its implementation shows particular promise in environments with high-dimensional action spaces, an area where traditional trust-region methods can prove computationally burdensome.

Implications and Future Work

This research has significant implications for the field of DRL. By decreasing the computational demands associated with natural gradient computations, ACKTR offers a pathway to leveraging deeper and wider networks more effectively. Furthermore, it lays a foundation for future exploration into scaling trust-region methods, potentially inspiring advancements beyond strictly actor-critic frameworks.

Future directions may involve adapting novel variance reduction techniques or merging the K-FAC scheme with other efficiency-based strategies for DRL optimization. Additionally, exploring its application to continuous control problems or more complex, real-world scenarios where computational resources are a limiting factor would be a valuable continuation of this work. Overall, this paper provides a significant step forward in the pursuit of more computationally efficient, scalable reinforcement learning methodologies.