Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Discovered Policy Optimisation (2210.05639v2)

Published 11 Oct 2022 in cs.LG and cs.AI

Abstract: Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations, intuitions, and experimentation. Such an approach of creating algorithms manually is limited by human understanding and ingenuity. In contrast, meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not outperformed existing hand-crafted algorithms. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential middle-ground starting point: while every method in this framework comes with theoretical guarantees, components that differentiate them are subject to design. In this paper we explore the Mirror Learning space by meta-learning a "drift" function. We refer to the immediate result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.

Citations (68)

Summary

  • The paper’s main contribution introduces DPO, a closed-form approximation derived from Learnt Policy Optimisation that overcomes hyperparameter brittleness.
  • The methodology employs Evolution Strategies to train a drift function within the Mirror Learning framework, ensuring robust convergence and enhanced exploration.
  • Empirical results demonstrate that DPO matches LPO’s performance while outperforming PPO, highlighting the promise of meta-learned RL algorithms.

Overview of Discovered Policy Optimisation

This paper, "Discovered Policy Optimisation," presents a novel approach to reinforcing learning (RL) algorithm development through a combination of theoretical frameworks and computational techniques. The research introduces a new RL algorithm, Learnt Policy Optimisation (LPO), derived through meta-learning within a restricted space of Mirror Learning algorithms. This approach addresses the limitations inherent in manually crafted algorithms, such as brittleness to hyperparameter settings and a lack of robustness guarantees. The paper also proposes a closed-form approximation of LPO named Discovered Policy Optimisation (DPO), which replicates the central features discovered by LPO.

Discovered Policy Optimisation: Methodology and Results

The primary innovation of this research lies in the exploration of the Mirror Learning framework through meta-learning, which results in the derivation of LPO and subsequently DPO. Meta-learning is employed to automatically discover the drift function within the Mirror Learning space, thereby providing a new RL algorithm with theoretical guarantees of convergence to optimal policies.

To achieve this, the authors utilize Evolution Strategies (ES) to effectively train the drift function network. The network is parameterized to account for probability ratios and advantage functions, while meta-training is executed across multiple environments to ensure generalization and robustness. The learning process identifies key features—rollback mechanisms for negative advantages and cautious optimism for positive advantages—that enhance the policy's entropy, thereby facilitating superior exploration capabilities. These features are crucial elements of the derived DPO algorithm.

The empirical results demonstrate that DPO matches the performance of its predecessor, LPO, and outstrips the capabilities of Proximal Policy Optimization (PPO), especially in its ability to generalize across different environments and hyperparameter settings. The success of DPO underscores the potential of meta-learned algorithms in discovering novel strategies that exceed traditional handcrafted approaches in RL challenges.

Implications and Future Directions

The implications of this research are twofold: it showcases the power of meta-learning constrained by theoretical frameworks like Mirror Learning in discovering algorithms with guaranteed theoretical soundness, and it highlights the utility of computational advancements such as ES in driving algorithmic innovation. DPO presents a model that can be implemented with ease while providing robust performance across a spectrum of environments.

This approach points to future research avenues in expanding the dimensionality of inputs to the drift function to include other algorithmic attributes and domain-specific parameters. Further exploration of other components within the Mirror Learning space using meta-learning techniques holds promise for additional improvements and discoveries in RL methodologies.

The paper excels by bridging the gap between theoretical guarantees and practical competence through intelligent algorithm discovery, suggesting a promising direction for future developments in AI and RL systems.

Youtube Logo Streamline Icon: https://streamlinehq.com