Counterfactual Risk Minimization: Learning from Logged Bandit Feedback (1502.02362v2)

Published 9 Feb 2015 in cs.LG and stat.ML

Abstract: We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method -- called Policy Optimizer for Exponential Models (POEM) -- for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.

Citations (167)

View on Semantic Scholar

Summary

The paper introduces the Counterfactual Risk Minimization (CRM) principle for robust batch learning from logged bandit feedback, addressing limitations of traditional methods by incorporating propensity scoring and variance-aware generalization bounds.
From the CRM framework, the authors derive the Counterfactual Exponential Model Learner (CEML), a new algorithm designed for structured output prediction using stochastic gradient optimization.
Experimental evaluations show CEML achieves significant performance gains, robustness, and computational efficiency over state-of-the-art methods, especially in multi-label classification with large prediction spaces.

Overview of Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

Swaminathan and Joachims present a significant contribution to the field of machine learning through their introduction of the Counterfactual Risk Minimization (CRM) principle designed specifically to tackle the challenges associated with batch learning from logged bandit feedback. The CRM principle addresses key limitations of traditional empirical risk minimization approaches by incorporating propensity scoring and variance-aware generalization error bounds. This paper not only lays the theoretical groundwork for counterfactual learning but also derives a new algorithm, the Counterfactual Exponential Model Learner (CEML), for structured output prediction.

Learning from Logged Bandit Feedback

The paper begins by recognizing the prevalent scenario where online systems generate data logs containing bandit feedback, an inherently partial and biased form of information that does not cover the full space of possible system responses. This logging method provides data only for the predictions made and observed, excluding potential outcomes, making batch learning fundamentally different from supervised learning. The traditional methods of utilizing logged bandit feedback typically involve attempts to approximate supervised learning models, which often result in poor generalization.

The Counterfactual Risk Minimization Principle

To overcome these challenges, Swaminathan and Joachims propose the CRM principle. The core of CRM lies in its ability to estimate counterfactual risk—a measure of how different models would perform based on available data—by using propensity scoring to adjust for the biases inherent in bandit feedback logs. This approach goes beyond mere unbiased risk estimation and entails a deeper consideration of variance across hypothesis spaces.

The CRM principle is backed by a generalization error bound that incorporates variance terms using empirical Bernstein arguments. This constructive bound provides a rigorous basis for defining a conservative confidence bound during hypothesis selection, ensuring robust learning from the uncertain estimations typical of bandit feedback data.

Counterfactual Exponential Model Learner (CEML)

From the CRM framework, the authors introduce CEML, an algorithm designed for structured output prediction leveraging stochastic linear models. CEML uses a learning objective derived through variance decomposition and employs stochastic gradient optimization techniques to fine-tune prediction performance. Experimental validation of CEML across multiple datasets demonstrates its robust performance, with marked improvements over state-of-the-art methods in robustness and generalization.

Experimental Validation

The empirical evaluations on multi-label classification problems reveal CEML's significant generalization performance gains, bettering the current best practices especially in scenarios with expansive prediction spaces. Furthermore, the stochastic optimization variant of CEML exhibits superior computational efficiency compared to batch approaches, indicating its suitability for large-scale applications.

Implications and Future Directions

The implications of CRM extend beyond the immediate batch learning tasks. The principle's variance-aware approach could potentially inform the design of algorithms across various domains including reinforcement learning and policy evaluation in bandit settings. Future research could explore CRM's adaptability to other feedback types such as ordinal or co-active, and its applicability in dynamic environments where historical algorithms vary.

In conclusion, the proposed CRM principle and the derived CEML algorithm represent a fundamental step forward in the domain of machine learning, offering innovative solutions to the challenge of learning from logged bandit feedback. By bridging the gap between propensity-focused risk estimation and structured learning, this work lays the groundwork for robust batch learning in complex environments.