Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation (2008.07146v5)

Published 17 Aug 2020 in cs.LG and stat.ML

Abstract: Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of enabling realistic and reproducible OPE research, we present Open Bandit Dataset, a public logged bandit dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN. Our dataset is unique in that it contains a set of multiple logged bandit datasets collected by running different policies on the same platform. This enables experimental comparisons of different OPE estimators for the first time. We also develop Python software called Open Bandit Pipeline to streamline and standardize the implementation of batch bandit algorithms and OPE. Our open data and software will contribute to fair and transparent OPE research and help the community identify fruitful research directions. We provide extensive benchmark experiments of existing OPE estimators using our dataset and software. The results open up essential challenges and new avenues for future OPE research.

Citations (65)

View on Semantic Scholar

Summary

The paper introduces a public bandit dataset from ZOZOTOWN, enabling realistic off-policy evaluation experiments.
It presents an open-source Python pipeline that standardizes data preprocessing, policy learning, and evaluation for transparent comparisons.
The comprehensive benchmarks reveal variable estimator performance with hyperparameter tuning, emphasizing careful estimator selection.

Open Bandit Dataset and Pipeline: A Critical Enabler for Off-Policy Evaluation Research

The paper "Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation" addresses a significant gap in the off-policy evaluation (OPE) research community by providing a real-world logged bandit dataset and associated software tools. This contribution is aimed at making OPE research more practical and reproducible, leveraging data from ZOZOTOWN, Japan's largest fashion e-commerce platform.

Context and Challenges

OPE is fundamental in estimating the performance of counterfactual policies based on data collected under different operational policies. This capability is crucial for applications ranging from recommendation systems to healthcare. Despite the theoretical advancements in OPE, experimental validation has lagged due to a scarcity of realistic datasets. Prior studies often relied on synthetic simulations or proprietary datasets, which lacked the diversity and representational fidelity needed for generalizable results.

Contributions of the Paper

Open Bandit Dataset: The dataset comprises logged bandit data from ZOZOTOWN, collected using two distinct policies: Bernoulli Thompson Sampling and a uniform random policy. This configuration allows researchers to conduct experimental comparisons between different OPE methods realistically. The dataset's public availability enables reproducibility, a cornerstone for validating and extending OPE research.

Open Bandit Pipeline: The authors introduce an open-source Python package that streamlines the experimental process of OPE and batch bandit algorithms. This pipeline provides standardized modules for dataset preprocessing, policy learning, and OPE, offering a unified interface to facilitate fair and transparent comparisons among various approaches.

Experimental Evaluation

The paper conducts comprehensive benchmark experiments utilizing the Open Bandit Dataset and Pipeline. It critically evaluates several OPE estimators, including Direct Method (DM), Inverse Probability Weighting (IPW), and Doubly Robust (DR), among others. A notable finding from these experiments is the variable performance of different estimators across different settings. The choice of estimator and its parameter tuning is crucial: for instance, the DRos estimator, with its hyperparameter optimization, shows superior performance, indicating a need for careful estimator selection based on the application context.

Implications and Future Directions

The availability of real-world logged bandit data is a significant step forward for OPE research, potentially leading to more accurate and validated models. Future work could focus on extending datasets to other domains, thereby broadening the scope of applicable research areas. Furthermore, developing advanced methods for automatic hyperparameter tuning and estimator selection remains a pressing need, as highlighted by the variability observed in the experiments.

As the first public dataset of its kind, the Open Bandit Dataset creates an important precedent for the community. It sets in motion a trend towards open science practices in machine learning research, promising a pathway to more robust, transparent, and applicable AI systems across diverse real-world scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - st-tech/zr-obp: Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation (671 stars)

YouTube

Show All Videos