More Robust Doubly Robust Off-policy Evaluation (1802.03493v2)

Published 10 Feb 2018 in cs.AI

Abstract: We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the model at the same time. Although the accuracy of the model has a huge impact on the overall performance of DR, most of the work on using the DR estimators in OPE has been focused on improving the IS part, and not much on how to learn the model. In this paper, we propose alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator. We first present a formulation for learning the DR model in RL. We then derive formulas for the variance of the DR estimator in both contextual bandits and RL, such that their gradients w.r.t.~the model parameters can be estimated from the samples, and propose methods to efficiently minimize the variance. We prove that the MRDR estimators are strongly consistent and asymptotically optimal. Finally, we evaluate MRDR in bandits and RL benchmark problems, and compare its performance with the existing methods.

Citations (252)

View on Semantic Scholar

Summary

The paper introduces MRDR estimators that optimize variance minimization in the doubly robust evaluation framework.
It establishes theoretical guarantees for unbiased performance in both contextual bandit and full MDP settings.
Empirical results show a consistent reduction in mean squared error, enhancing off-policy evaluation reliability.

More Robust Doubly Robust Off-policy Evaluation: An Analysis

The research paper titled "More Robust Doubly Robust Off-policy Evaluation" addresses a fundamental problem in reinforcement learning: accurately evaluating a target policy using data gathered under a different policy. This challenge falls under the umbrella of off-policy evaluation (OPE), a critical task in various domains such as marketing, healthcare, and robotics, where risky deployment of unverified policies can result in adverse outcomes.

The paper emphasizes the combination of direct method (DM) and importance sampling (IS) under the doubly robust (DR) framework, aiming to minimize the shortcomings inherent to each approach when used separately. While DM can suffer from significant bias without a precise system model, IS often struggles with high variance, especially when there is a substantial divergence between the evaluation policy and the behavior policy.

Key Contributions

The authors introduce the "more robust doubly robust" (MRDR) estimators, which optimize the DM component of DR by minimizing the variance of the entire estimator. This approach has its roots in existing solutions for regression problems with missing data but is applied here with novel adaptations.

Variance Minimization: The paper formulates an optimization problem targeting the variance of the DR estimator. This involves developing calculations for the variance with respect to model parameters, both in the contextual bandit setting and in reinforcement learning. The MRDR estimators are shown to be strongly consistent and asymptotically optimal.
Theoretical Guarantees: Through rigorous derivations, the MRDR estimators leverage both contextual bandit (a single-step MDP problem) and full MDP settings (sequential decision making over time), ensuring a low variance estimate without introducing bias under known behavior policies. It fills the gap left by prior works which primarily improved the IS component without systemically optimizing the DM component.
Empirical Evaluation: Experiments on standard benchmark datasets, including contextual bandits and reinforcement learning domains, demonstrate the superior performance of MRDR compared to existing methods. The results indicate reduced mean squared error (MSE) across different scenarios, validating the effectiveness of MRDR.

Implications

The MRDR estimators offer significant practical benefits by improving the reliability of policy evaluations in real-world applications. The theoretical underpinning provided by the authors ensures that in situations where deploying untested policies is risky, MRDR can offer robust evaluations with minimal risk of error.

Future Directions

Future research paths may explore:

Extending MRDR to handle multiple behavior policies or unknown behavior policies, which expands the estimator's applicability to scenarios with incomplete data provenance.
Investigating the impact of high-dimensional action spaces, as in combinatorial bandits, which could present challenges in scaling the MRDR's optimization approach.
Integrating with modern reinforcement learning algorithms to dynamically learn behavior policies, potentially enhancing real-time decision-making systems.

In summary, this paper contributes a sophisticated methodology for advancing off-policy evaluation by leveraging a more integrated and theoretically grounded approach to the combination of DM and IS, bringing us closer to safer and more reliable deployment of AI systems in critical applications.

PDF Markdown