Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping (1206.3285v1)

Published 13 Jun 2012 in cs.AI, cs.LG, and cs.SY

Abstract: We consider the problem of efficiently learning optimal control policies and value functions over large state spaces in an online setting in which estimates must be available after each interaction with the world. This paper develops an explicitly model-based approach extending the Dyna architecture to linear function approximation. Dynastyle planning proceeds by generating imaginary experience from the world model and then applying model-free reinforcement learning algorithms to the imagined state transitions. Our main results are to prove that linear Dyna-style planning converges to a unique solution independent of the generating distribution, under natural conditions. In the policy evaluation setting, we prove that the limit point is the least-squares (LSTD) solution. An implication of our results is that prioritized-sweeping can be soundly extended to the linear approximation case, backing up to preceding features rather than to preceding states. We introduce two versions of prioritized sweeping with linear Dyna and briefly illustrate their performance empirically on the Mountain Car and Boyan Chain problems.

Authors (4)

Richard S. Sutton (65 papers)
Csaba Szepesvari (157 papers)
Alborz Geramifard (22 papers)
Michael P. Bowling (2 papers)

Citations (199)

View on Semantic Scholar

Summary

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

The paper "Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping" by Sutton, Szepesvári, Geramifard, and Bowling addresses a significant challenge in reinforcement learning: efficiently learning optimal control policies and value functions in large state spaces under online settings. This paper extends the Dyna architecture to incorporate linear function approximation, a move that enables the handling of more extensive problems than a tabular representation would permit.

The research presents a model-based approach wherein linear Dyna-style planning generates imaginary experiences and applies model-free reinforcement learning algorithms to these imagined transitions. A notable contribution of this work is the theoretical proof that Dyna-style planning using linear function approximation converges to a unique solution that is independent of the generating distribution. The convergence limit in a policy evaluation setting aligns with the least-squares temporal difference (LSTD) solution.

The implications are profound: prioritized sweeping, traditionally used in tabular settings, can be reliably extended to scenarios involving linear approximation. The researchers introduce and empirically illustrate the performance of two versions of prioritized sweeping with linear Dyna on problems like the Mountain Car and Boyan Chain.

Convergence and Theoretical Insights

The paper's theoretical section begins by delineating conditions under which linear Dyna planning converges. One central finding is that the convergence of linear plans, whether through TD(0) or residual-gradient methods, remains unaffected by the feature sampling distribution p, provided that p adequately exercises the n-dimensional vector space. The determinant criterion for convergence shifts to the structure of F, which describes the world's linear model. The iteration converges to a deterministic fixed point where the temporal difference (TD) error is zero for all generated features.

In terms of control, the paper examines convergence issues when planning with multiple models for action, concluding that convergence, when achieved, indicates arriving at a unique least-squares solution. Conditions under which the convergence to the fixed point occurs align closely with the existence of a solution to the overall dynamic programming problem.

Experimental Evaluations

The paper includes empirical evaluations conducted on the extended Boyan Chain and Mountain Car domains. The experiments compare four algorithms: Dyna-Random, Dyna-PWMA, Dyna-MG, and model-free TD(0). In the Boyan Chain scenario, linear Dyna methods outperformed TD(0), with prioritized variants showcasing improved learning efficiency. Dyna-MG, characterized by its efficient planning strategy, leads in accelerating learning, although it initially exhibited more extensive variance due to its high variance model when surfacing from few experiences.

In the Mountain Car domain, Dyna-MG demonstrated a distinct improvement over TD in early learning speed, although other Dyna variants did not surpass TD(0). Notably, despite Dyna-MG requiring multiple updates, it still ran faster due to computational efficiencies. Control experiments with Dyna-MG in Mountain Car scenarios revealed comparable learning rates to model-free Sarsa, reinforcing the robustness of model-based planning in learning dynamics.

Implications and Future Prospects

This work showcases how integrating linear function approximation into the Dyna model expands its applicability to larger and more complex domains. The convergence guarantees and empirical results underline Dyna's potential as a practical method for rapid adaptation in online RL settings. The prioritization techniques extended to linear models promise heightened learning efficiency and accuracy in RL applications. Further research could focus on refining these prioritization strategies and expanding their use in diverse, real-world challenges, serving to underpin new developments within AI and RL domains.

PDF Markdown

Related Papers

Find Related Papers