Papers
Topics
Authors
Recent
2000 character limit reached

Temporal Second Difference Traces

Published 24 Apr 2011 in cs.LG | (1104.4664v1)

Abstract: Q-learning is a reliable but inefficient off-policy temporal-difference method, backing up reward only one step at a time. Replacing traces, using a recency heuristic, are more efficient but less reliable. In this work, we introduce model-free, off-policy temporal difference methods that make better use of experience than Watkins' Q(\lambda). We introduce both Optimistic Q(\lambda) and the temporal second difference trace (TSDT). TSDT is particularly powerful in deterministic domains. TSDT uses neither recency nor frequency heuristics, storing (s,a,r,s',\delta) so that off-policy updates can be performed after apparently suboptimal actions have been taken. There are additional advantages when using state abstraction, as in MAXQ. We demonstrate that TSDT does significantly better than both Q-learning and Watkins' Q(\lambda) in a deterministic cliff-walking domain. Results in a noisy cliff-walking domain are less advantageous for TSDT, but demonstrate the efficacy of Optimistic Q(\lambda), a replacing trace with some of the advantages of TSDT.

Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.