Offline Imitation Learning in Contextual Bandits
- The paper introduces a PIL-IML framework that leverages a surrogate objective to approximate the reward-maximizing policy while controlling high importance weight variance.
- It employs reward-weighted cross-entropy when logging probabilities are missing, ensuring robust policy learning despite incomplete propensity scores and confounding variables.
- Empirical evaluations on simulated and real-world datasets demonstrate improved performance over traditional IPWE methods, supporting safe deployment and effective diagnostic assessment.
Offline imitation learning in contextual bandits is the paper and practice of inferring a reward-maximizing policy from static, logged data generated by a historical decision-making system operating under the contextual bandit paradigm. This learning challenge centers on how to utilize data comprised of context–action–reward tuples (often with randomized action selection and possibly missing probabilities), to construct policies that can either safely imitate—or improve upon—the historical behavior while avoiding the pitfalls of confounding, distribution shift, and high variance estimation.
1. Problem Formulation and Contextual Bandit Model
The offline contextual bandit model considered in (Ma et al., 2019) collects data consisting of tuples , where %%%%1%%%% denotes the context, the chosen action (by a logging policy ), and the observed (non-negative) reward. Notably, action selection is often randomized by , and rewards exist only for the actions actually taken. Furthermore, unobserved confounders that influence both and may be present, resulting in the data-generating process:
- ,
- (possibly context- and confounder-dependent action set),
- .
The learning objective is to construct a new policy that, if deployed, will maximize expected reward:
despite having only empirical feedback from the logged policy rather than all possible actions under each context.
2. Inverse Probability Weighted Estimation and Its Limitations
A canonical approach for off-policy evaluation and imitation in offline contextual bandits is inverse probability weighted estimation (IPWE):
where is the importance weight. This estimator is unbiased under known propensities but is vulnerable to:
- Unavailable or missing logging probabilities (engineering/collection limitations),
- Small logging probabilities: If is close to zero, becomes large, leading to extremely high variance in , with single rare events potentially dominating the sum and rendering confidence intervals and significance testing unreliable.
3. Policy Improvement Objectives and Policy Imitation Regularization
To address IPWE's variance and applicability limitations, a policy improvement objective (PIL) is proposed. PIL is a lower-bound surrogate to IPWE, derived from the inequality . With available probabilities, a useful lower bound is:
whereas with missing , the formulation reverts to reward-weighted cross-entropy:
Policy Imitation Learning (IML) regularizes this objective:
which is the KL-divergence averaged over —minimizing IML encourages to mimic the logging policy, thereby reducing the variance of the importance weights. Explicitly, Taylor expansion reveals:
making the average IML loss a direct proxy for the variance of IPWE.
The unified learning objective is:
where governs the exploitation–exploration (variance–bias) balance. When logging probabilities are unavailable, reward-weighted cross-entropy (as in standard supervised learning) is shown to be a justifiable surrogate.
4. Probability Logging, Diagnosability, and Confounding
Probability logging, the practice of storing , serves two crucial purposes:
- Bias Correction: With complete action propensities, unbiased IPWE or PIL can be performed.
- Model Diagnosability: High IML loss (or high perplexity) is a diagnostic for either confounding (missing variables ) or policy class misspecification, since the logging policy cannot be well-explained (or imitated) using the available model class. Thus, IML underfitting flags hidden influences and motivates model refinement or careful policy deployment.
The framework is thus adaptive: with missing probabilities, it defaults to robustly regularized cross-entropy; with full propensities, it can both debias and analyze model misspecification.
5. Simulation Results and Empirical Insights
Simulation studies validate the framework on both classical and real-world datasets:
- Simpson’s Paradox (Kidney Stone Data): By modeling both observed (size) and hidden confounders during randomized assignment, the PIL-IML approach correctly re-weights to recommend the effective treatment, overcoming the paradox that plagues unadjusted analyses.
- UCI Multiclass-to-Bandit Conversions: Benchmarking against Q-learning, vanilla IPWE, and doubly robust estimators, the PIL-IML approach demonstrates lower variance and superior performance, particularly under model misspecification. The reward-weighted cross-entropy surrogate proves robust when action propensities are missing.
- Criteo Counterfactual Data: Facing extreme heavy-tail importance weights (up to 49,000), IPWE is unusable without variance control. PIL-IML, along with weight clipping and bootstrapping, produces usable estimates, and reveals via IML that essential confounders are not available in the dataset—a practical diagnostic for real-world data quality limitations.
Moreover, the method supports IML-resampling: Using the learned imitation policy to resample the logged data, thereby increasing exploration and improving subsequent learning.
6. Practical Implications and Deployment Considerations
- Variance Control: Explicit regularization by KL-divergence (IML) is effective at controlling the variance of offline policy improvement, which is especially critical in high-dimensional action spaces or sparse data regimes.
- Policy Class Selection: The IML diagnostic (high perplexity) provides a principled way to assess whether the chosen policy parameterization is adequate or confounded, guiding both model selection and safe policy deployment.
- Handling Missing Data: The explicit connection between cross-entropy loss and the IPWE surrogate allows practitioners to deploy offline imitation learning without strict logging requirements, making the approach robust to operational and engineering challenges.
- Future Optimization: Weight clipping, reward-weighted losses, and IML-resampling are recommended when deploying in environments with heavy-tailed propensities or suspected confounding.
7. Summary Table: Key Techniques and Formulae
Technique / Concept | Mathematical Expression | Role / Purpose |
---|---|---|
IPWE | Unbiased policy value estimation | |
Weight Clipping, PIL | see above for PIL expressions | Variance reduction, lower-bounding |
Reward-weighted Cross-Entropy (CE) | Surrogate for missing propensities | |
IML Regularization | , approximates variance | Variance control, misspecification |
Total Objective | Combines policy improvement & control | |
Policy Diagnosability via IML | High IML loss confounding or misspecification | Data/model quality assessment |
Greedy Policy Update | Gradient of PIL-IML objective | Local equivalence to natural gradient |
This framework—anchored in convex surrogates for high-variance estimators and regularization by policy imitation—provides both sound theoretical guarantees and empirical evidence for its effectiveness in offline imitation learning for contextual bandits. The diagnostic properties of IML loss, adaptability to incomplete logs, and robust empirical performance across datasets establish it as a foundational approach for real-world, reliable offline policy learning.