Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 68 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 34 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 441 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

Dual Policy Iteration (1805.10755v2)

Published 28 May 2018 in cs.LG and stat.ML

Abstract: Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.

Citations (54)

View on Semantic Scholar

Summary

Dual Policy Iteration and its Implications in Reinforcement Learning

The paper "Dual Policy Iteration" explores a class of Approximate Policy Iteration (API) algorithms that aim to improve sample efficiency in Reinforcement Learning (RL) through the integration of model-free and model-based approaches. The primary focus is on Dual Policy Iteration (DPI), a strategy employing two policies: a fast, reactive policy for testing and a slow, non-reactive policy that guides the reactive policy during training. The paper provides both theoretical insights and practical algorithms to demonstrate DPI's efficacy across varied control tasks.

Overview of Dual Policy Iteration

Dual Policy Iteration is a novel approach within API algorithms that leverages alternating optimization of two distinct types of policies. The reactive policy, typically implemented via function approximators such as neural networks, is optimized to quickly respond during test scenarios. Conversely, the non-reactive policy, involving potentially complex planning such as Tree Search, assists the reactive policy's learning by engaging in deeper strategic exploration. This alternating structure is inspired by the performance difference insights from previous RL theoretical work, particularly focusing on maximizing disadvantage via planned actions.

Theoretical Contributions

The paper expands current API theory by providing convergence guarantees for DPI. One significant theoretical contribution is the analysis of how DPI can offer larger policy improvements per iteration compared to existing methods like CPI and TRPO. The authors provide a detailed convergence analysis, showing that DPI's policy improvement consists of both independent improvements from each optimization step and enhancements derived from integrating the learned dynamics models.

Moreover, the paper explores a constrained optimization setup, showing how a non-reactive policy can be updated through model-based optimal control (MBOC) using learned local models. This approach allows for a theoretically sound combination of model-free and model-based RL. An essential aspect is the formulation of trust regions, enhancing the stability and reliability of policy updates.

Practical Implications and Results

To evaluate DPI, the authors conducted experiments on both discrete and continuous control tasks, including synthetic MDPs, Cart-Pole, Helicopter Aerobatics, and various locomotion tasks from the MuJoCo simulator. The results consistently demonstrated that DPI and its instantiated algorithms (such as AggreVaTeD-GPS) achieve more sample-efficient learning than classic CPI methods and actor-critic baselines (TRPO-GAE). The model-based search yields significant sample efficiencies, highlighting the benefits of integrating learned local dynamics into policy improvement procedures.

The paper also explores robust policy optimization settings, showcasing the utility of DPI in environments requiring adaptivity across dynamic conditions. This adaptability underscores DPI's potential in real-world applications where model dynamics are partially unknown or vary significantly.

Future Directions

The research opens avenues for further enhancements in RL algorithms by addressing complex dynamics learning and improving linear models to reduce approximation error. Moreover, continued exploration of DPI's application in large-scale RL scenarios, such as robotics or complex strategic games, may provide deeper insights into systematic exploration techniques, potentially revolutionizing how policies are learned in environments with vast state and action spaces.

In conclusion, the paper "Dual Policy Iteration" represents a meaningful contribution to RL research, presenting an innovative framework that integrates structured exploration via model-based planning into conventional policy iteration strategies. This approach not only aligns closely with successful practical algorithms like AlphaGo-Zero but also sets a foundation for improved RL methods with faster convergence and greater applicability in diverse control tasks.