Doubly Robust Policy Evaluation and Learning (1103.4601v2)

Published 23 Mar 2011 in cs.LG, cs.AI, cs.RO, stat.AP, and stat.ML

Abstract: We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice.

Citations (675)

View on Semantic Scholar

Summary

The paper introduces a doubly robust estimator that combines Direct Method and IPS to reduce bias and variance in contextual bandit evaluations.
It provides a theoretical analysis demonstrating that the estimator is unbiased when at least one model is correctly specified and achieves lower variance than IPS.
Empirical results on benchmark and real-world datasets confirm that the DR estimator consistently outperforms traditional DM and IPS in policy optimization.

Doubly Robust Policy Evaluation and Learning

The paper "Doubly Robust Policy Evaluation and Learning" by Dudik, Langford, and Li addresses the challenge of policy evaluation and optimization in environments where rewards are only partially observed—a scenario often encountered in contextual bandit settings. The contextual bandit problem is prevalent in various domains, including healthcare policy decision-making and internet advertising. A significant hurdle in these applications is the evaluation of new policies using historical data that may have been collected with different action selection strategies than those proposed by the new policy.

Contextual Bandit Evaluation Approaches

Historically, two primary evaluation strategies have emerged: the Direct Method (DM) and Inverse Propensity Score (IPS) approaches. DM relies on estimating the reward function from the available data. However, this method can be biased if the model inaccurately captures the reward dynamics. Conversely, IPS adjusts for discrepancies in action distributions between the new and historical policies by using importance weighting. While IPS is often unbiased due to its reliance on accurate modeling of historical policies, it suffers from high variance, particularly when historical policies differ substantially from new ones.

Introduction of the Doubly Robust Estimator

The authors propose a solution combining both approaches through Doubly Robust (DR) estimation. The DR estimator leverages strengths from both DM and IPS, thereby providing an accurate evaluation when either model of rewards or past policy is correct. This estimator is unbiased if at least one of the estimators is accurate, addressing both bias and variance concerns present in either standalone approach.

The theoretical underpinning of the DR approach is demonstrated through a detailed bias and variance analysis. The paper elucidates that the DR estimator's expectation aligns with the true policy value when deviations in the reward or action probability estimates are minimal. Moreover, the variance of the DR estimator is shown to be lower than IPS in practical scenarios, where moderate reward model accuracy is achievable.

Empirical Validation

The paper presents empirical results using benchmark datasets in classification tasks transformed into contextual bandit settings. The results indicate that DR consistently outperforms both DM and IPS in estimating policy values. For policy optimization tasks, DR shows substantial improvement over IPS and is competitive with more sophisticated methods like the Offset Tree approach.

In the applied context, a real-world dataset of user visits to an internet portal is also used to validate the effectiveness of the DR estimator. Again, DR is proven to reduce variance and improve RMSE compared to IPS, especially in small sample scenarios.

Implications and Future Directions

This work implies significant advancements in policy evaluation methodology for contextual bandits by providing more robust estimation techniques, potentially enhancing policy decision-making in numerous applications. The DR method's adaptability to different algorithmic structures underscores its potential ubiquity in practice.

Looking forward, the integration of DR techniques with more advanced biases and variance reduction strategies could further optimize policies in dynamic and uncertain environments. Moreover, developing algorithms that inherently consider the doubly robust principles in their design could provide more precise tools for AI systems relying on batch learning from historical data.

The methodology and findings of this paper pave the way for further exploration into more sophisticated policy evaluation and optimization frameworks, motivating future theoretical and empirical investigations in AI and machine learning domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/olivierjeunen/status/1783564906156945899