Learning and Planning in Average-Reward Markov Decision Processes (2006.16318v3)

Published 29 Jun 2020 in cs.LG and cs.AI

Abstract: We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use.

Citations (52)

View on Semantic Scholar

Summary

The paper introduces Differential Q-learning and Differential TD-learning algorithms that remove the need for reference functions in average-reward MDPs.
The novel methods guarantee convergence and unbiased estimation of reward rates in continuous decision-making environments.
Empirical evaluations show these algorithms outperform traditional approaches in control tasks and queuing scenarios.

An Expert Essay on "Learning and Planning in Average-Reward Markov Decision Processes"

The paper "Learning and Planning in Average-Reward Markov Decision Processes" introduces innovative algorithms for addressing average-reward Markov decision processes (MDPs). This approach to MDPs emphasizes constant, ongoing decision-making, distinct from episodic or discounted approaches. The focus on average-reward MDPs fits well into the theoretical and practical frameworks essential for reinforcement learning (RL) and AI, where optimizing the mean reward per time step is critical. This paper makes notable contributions in proposing algorithms that rectify significant limitations in existing RL methodologies, particularly regarding convergence and applicability in more generalized scenarios.

Key Contributions

The paper lays out three main contributions:

Differential Q-learning: The first off-policy model-free control algorithm tailored for average-reward settings that does not depend on reference states. Prior algorithms, such as RVI Q-learning, critically depend on a reference function, potentially leading to slow convergence unless carefully selected. Differential Q-learning circumvents this by maintaining an explicit estimate of the reward rate.
Differential TD-learning: This algorithm provides a manner to estimate both the reward rate and the differential value function without reference to offsets, ensuring convergence to actual values. This marks a critical advancement since current off-policy learning methods often introduce potential biases due to reliance on external offsets.
Centered Value Functions: The research introduces a mechanism to estimate the centered differential value function. The ability to compute the actual differential value function, free from extraneous scaling factors, enhances interpretability and utility in decision-making contexts.

Empirical Investigation and Results

The authors rigorously evaluate the new algorithms on established benchmark problems. In control scenarios, tasks like Access-Control Queuing illustrate the advantages of Differential Q-learning over classical algorithms. The lack of reliance on reference functions makes it broadly robust, performing consistently well across parameter variations. These insights indicate its practical viability for deployment in a range of real-world RL applications where parameter sensitivity can pose significant challenges.

For predictive tasks, the Two Loop environment serves as a testing ground for Differential TD-learning. Differential TD-learning's convergence properties, unaffected by initialization or other such concerns, present strong evidence for its theoretical assertions. Meanwhile, comparative studies with existing techniques highlight its advantages in both conception and implementation.

Implementation and Implications

This paper tackles an evident theoretical gap in average-reward reinforcement learning by proposing methods with powerful proof-of-convergence properties. These algorithms stand to inform future model-free RL framework designs, offering a more generalized approach adaptable to environments where episodic delineation is not feasible.

In applied contexts, these methodologies can be crucial for industries like telecommunications for solving queuing problems and logistics for optimizing average turnaround times. Furthermore, the simplicity of maintaining and estimating real-time reward rates holds promise for dynamic, real-world applications needing consistent optimization over prolonged operational periods.

Future Directions

Future research can extend these ideas by exploring function approximation integrations, crucial for scalability beyond tabular problem definitions. Although Differential Q-learning and Differential TD-learning shine in the tabular context, complexities introduced by function approximations pose a nuanced domain for further investigation. Additionally, applying these methods in semi-MDP domains could facilitate handling temporal abstractions like options, amplifying RL's utility in hierarchical task execution.

In summary, "Learning and Planning in Average-Reward Markov Decision Processes" significantly advances the state-of-the-art in RL. By addressing core limitations with novel algorithms, this paper enhances the theoretical foundation and practical application of average-reward reinforcement learning. Future studies building on this work could see widespread implications across AI, operational research, and automation.

PDF Markdown

Related Papers

YouTube

Show All Videos