Minimax Weight and Q-Function Learning for Off-Policy Evaluation (1910.12809v4)

Published 28 Oct 2019 in cs.LG and stat.ML

Abstract: We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average BeLLMan errors and can be combined with MWL in a doubly robust manner. (3) Several additional results that offer further insights into these methods, including the sample complexity analyses of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL.

Citations (170)

View on Semantic Scholar

Summary

The paper introduces Minimax Weight Learning (MWL) and Minimax Q-Function Learning (MQL) as novel algorithms for improved off-policy evaluation with lower variance in reinforcement learning.
MWL directly learns state-action importance ratios, eliminating the need for explicit knowledge of the behavior policy and mitigating variance growth over time horizons.
MQL learns Q-functions by minimizing Bellman errors, while experiments validate that both MWL and MQL significantly improve off-policy evaluation performance.

Overview of Minimax Weight and Q-Function Learning for Off-Policy Evaluation

This paper investigates off-policy evaluation (OPE) in the domain of reinforcement learning (RL), focusing on the challenges associated with estimating the performance of a target policy using historical data obtained from a different behavior policy. The researchers introduce novel algorithms—Minimax Weight Learning (MWL) and Minimax Q-Function Learning (MQL)—which provide improved estimators for marginalized importance weights and value functions without reliance on explicit knowledge of the behavior policy. This marks a significant shift from previous approaches, which often suffered from the "curse of horizon," a phenomenon where variance grows exponentially with the length of the decision horizon.

Key Contributions

MWL Estimator: This estimator directly learns importance ratios over state-action distributions, thereby eliminating the requirement for behavior policy knowledge. By modeling the relationship between the state-action distributions induced by the target policy and those collected via the behavior policy, the MWL estimator overcomes challenges traditionally associated with exponential variance growth.
MQL Estimator: The MQL estimator inverts the roles of importance weights and value functions found in MWL. By using importance weights as discriminators during the learning of Q-functions, MQL minimizes the average BeLLMan errors. This approach exhibits symmetry with MWL and can be used in conjunction in a doubly robust manner.
Statistical Insights: The paper offers insights regarding the sample complexities and asymptotic optimality of MWL and MQL. In particular, it is demonstrated that these techniques can achieve the semiparametric lower bound of OPE in the tabular setting, a benchmark previously unmet by many OPE methodologies.

Theoretical and Practical Implications

The introduction of MWL and MQL provides new perspectives on reinforcement learning by integrating function approximation techniques with advancements in discriminative modeling. Such approaches allow for a broader class of problems to be evaluated effectively, including scenarios where high variance and biased estimates were challenging. The reduced dependency on behavior policy knowledge implies potentially significant implications for practical applications in robotics, autonomous systems, and other domains where off-policy learning is critical.

Speculation on Future Developments

The proposed techniques suggest promising directions for advancing RL methods. Specifically, integrating more advanced discriminators, such as those based on neural networks or kernel methods, may further improve the robustness and efficiency of the OPE. As function approximation techniques evolve, these OPE methods could be extended to accommodate richer, more complex environments, potentially enhancing autonomous decision-making systems' capabilities.

Experimental Validation

In various experimental settings, MWL and MQL demonstrate substantial improvements over existing OPE methods, showing reduced mean squared errors and better estimative performance under different configurations of data and target policies. These findings underpin the theoretical claims with empirical evidence, reinforcing the viability of the proposed strategies in real-world applications.

In conclusion, this work contributes significantly to the landscape of off-policy evaluation in reinforcement learning, offering innovative solutions that broaden the horizons for scalable, efficient policy evaluation in complex decision-making environments. Future research may extend these approaches, integrate them into larger frameworks, or explore their applications to improve general RL algorithms' performance.