Offline Imitation Learning in Contextual Bandits

Updated 20 October 2025

The paper introduces a PIL-IML framework that leverages a surrogate objective to approximate the reward-maximizing policy while controlling high importance weight variance.
It employs reward-weighted cross-entropy when logging probabilities are missing, ensuring robust policy learning despite incomplete propensity scores and confounding variables.
Empirical evaluations on simulated and real-world datasets demonstrate improved performance over traditional IPWE methods, supporting safe deployment and effective diagnostic assessment.

Offline imitation learning in contextual bandits is the paper and practice of inferring a reward-maximizing policy from static, logged data generated by a historical decision-making system operating under the contextual bandit paradigm. This learning challenge centers on how to utilize data comprised of context–action–reward tuples (often with randomized action selection and possibly missing probabilities), to construct policies that can either safely imitate—or improve upon—the historical behavior while avoiding the pitfalls of confounding, distribution shift, and high variance estimation.

1. Problem Formulation and Contextual Bandit Model

The offline contextual bandit model considered in (Ma et al., 2019) collects data consisting of tuples $(x, a, r)$ , where %%%%1%%%% denotes the context, $a$ the chosen action (by a logging policy $\mu$ ), and $r$ the observed (non-negative) reward. Notably, action selection is often randomized by $\mu$ , and rewards exist only for the actions actually taken. Furthermore, unobserved confounders $h$ that influence both $a$ and $r$ may be present, resulting in the data-generating process:

$(x, h) \sim P(x, h)$ ,
$a \sim \mu(a|x, h)$ (possibly context- and confounder-dependent action set),
$r \sim P(r|x,h,a)$ .

The learning objective is to construct a new policy $\pi(a|x)$ that, if deployed, will maximize expected reward:

$\mathbb{E}_{\pi}[r] = \mathbb{E}_x\left[ \sum_{a \in A(x)} \pi(a|x) r(x, a) \right],$

despite having only empirical feedback from the logged policy rather than all possible actions under each context.

2. Inverse Probability Weighted Estimation and Its Limitations

A canonical approach for off-policy evaluation and imitation in offline contextual bandits is inverse probability weighted estimation (IPWE):

$\Delta \mathrm{IPWE}(\pi) = \frac{1}{n} \sum_{i=1}^n \left( \frac{\pi(a_i|x_i)}{\mu(a_i|x_i)} - 1 \right) r_i,$

where $w_i = \pi(a_i|x_i)/\mu(a_i|x_i)$ is the importance weight. This estimator is unbiased under known propensities but is vulnerable to:

Unavailable or missing logging probabilities (engineering/collection limitations),
Small logging probabilities: If $\mu(a|x)$ is close to zero, $w$ becomes large, leading to extremely high variance in $\Delta \mathrm{IPWE}$ , with single rare events potentially dominating the sum and rendering confidence intervals and significance testing unreliable.

3. Policy Improvement Objectives and Policy Imitation Regularization

To address IPWE's variance and applicability limitations, a policy improvement objective (PIL) is proposed. PIL is a lower-bound surrogate to IPWE, derived from the inequality $\log w \leq w - 1$ . With available probabilities, a useful lower bound is:

$\mathrm{PIL}_\mu(\pi) = \frac{1}{n} \sum_{i=1}^n r_i [\log w_i \cdot 1_{\{w_i \geq 1\}} + (w_i - 1) \cdot 1_{\{w_i < 1\}}],$

whereas with missing $\mu$ , the formulation reverts to reward-weighted cross-entropy:

$\mathrm{PIL}_\emptyset(\pi) = \frac{1}{n} \sum_{i=1}^n r_i \log w_i.$

Policy Imitation Learning (IML) regularizes this objective:

$\mathrm{IML}_{\text{part}}(\pi) = -\frac{1}{n}\sum_{i=1}^n \log\left(\frac{\pi(a_i|x_i)}{\mu(a_i|x_i)}\right),$

which is the KL-divergence $\mathrm{KL}(\mu \Vert \pi)$ averaged over $\mu$ —minimizing IML encourages $\pi$ to mimic the logging policy, thereby reducing the variance of the importance weights. Explicitly, Taylor expansion reveals:

$-\log w \approx \frac{1}{2} (w - 1)^2,$

making the average IML loss a direct proxy for the variance of IPWE.

The unified learning objective is:

$\max_\pi\ \mathrm{PIL}(r, \pi) + \epsilon\, \mathrm{IML}(\pi),$

where $\epsilon$ governs the exploitation–exploration (variance–bias) balance. When logging probabilities are unavailable, reward-weighted cross-entropy (as in standard supervised learning) is shown to be a justifiable surrogate.

4. Probability Logging, Diagnosability, and Confounding

Probability logging, the practice of storing $\mu(a|x)$ , serves two crucial purposes:

Bias Correction: With complete action propensities, unbiased IPWE or PIL can be performed.
Model Diagnosability: High IML loss (or high perplexity) is a diagnostic for either confounding (missing variables $h$ ) or policy class misspecification, since the logging policy cannot be well-explained (or imitated) using the available model class. Thus, IML underfitting flags hidden influences and motivates model refinement or careful policy deployment.

The framework is thus adaptive: with missing probabilities, it defaults to robustly regularized cross-entropy; with full propensities, it can both debias and analyze model misspecification.

5. Simulation Results and Empirical Insights

Simulation studies validate the framework on both classical and real-world datasets:

Simpson’s Paradox (Kidney Stone Data): By modeling both observed (size) and hidden confounders during randomized assignment, the PIL-IML approach correctly re-weights to recommend the effective treatment, overcoming the paradox that plagues unadjusted analyses.
UCI Multiclass-to-Bandit Conversions: Benchmarking against Q-learning, vanilla IPWE, and doubly robust estimators, the PIL-IML approach demonstrates lower variance and superior performance, particularly under model misspecification. The reward-weighted cross-entropy surrogate proves robust when action propensities are missing.
Criteo Counterfactual Data: Facing extreme heavy-tail importance weights (up to 49,000), IPWE is unusable without variance control. PIL-IML, along with weight clipping and bootstrapping, produces usable estimates, and reveals via IML that essential confounders are not available in the dataset—a practical diagnostic for real-world data quality limitations.

Moreover, the method supports IML-resampling: Using the learned imitation policy to resample the logged data, thereby increasing exploration and improving subsequent learning.

6. Practical Implications and Deployment Considerations

Variance Control: Explicit regularization by KL-divergence (IML) is effective at controlling the variance of offline policy improvement, which is especially critical in high-dimensional action spaces or sparse data regimes.
Policy Class Selection: The IML diagnostic (high perplexity) provides a principled way to assess whether the chosen policy parameterization is adequate or confounded, guiding both model selection and safe policy deployment.
Handling Missing Data: The explicit connection between cross-entropy loss and the IPWE surrogate allows practitioners to deploy offline imitation learning without strict logging requirements, making the approach robust to operational and engineering challenges.
Future Optimization: Weight clipping, reward-weighted losses, and IML-resampling are recommended when deploying in environments with heavy-tailed propensities or suspected confounding.

7. Summary Table: Key Techniques and Formulae

Technique / Concept	Mathematical Expression	Role / Purpose
IPWE	$\frac{1}{n} \sum (w_i - 1) r_i$	Unbiased policy value estimation
Weight Clipping, PIL	see above for PIL expressions	Variance reduction, lower-bounding
Reward-weighted Cross-Entropy (CE)	$-(1/n) \sum r_i \log \pi(a_i\|x_i)$	Surrogate for missing propensities
IML Regularization	$-\frac{1}{n} \sum \log(\pi/\mu)$ , approximates variance	Variance control, misspecification
Total Objective	$\max_\pi\ \mathrm{PIL}(r,\pi) + \epsilon\, \mathrm{IML}(\pi)$	Combines policy improvement & control
Policy Diagnosability via IML	High IML loss $\iff$ confounding or misspecification	Data/model quality assessment
Greedy Policy Update	Gradient of PIL-IML objective	Local equivalence to natural gradient

This framework—anchored in convex surrogates for high-variance estimators and regularization by policy imitation—provides both sound theoretical guarantees and empirical evidence for its effectiveness in offline imitation learning for contextual bandits. The diagnostic properties of IML loss, adaptability to incomplete logs, and robust empirical performance across datasets establish it as a foundational approach for real-world, reliable offline policy learning.

PDF Markdown Chat (Pro)

References (1)

Imitation-Regularized Offline Learning (2019)

Follow Topic

Get notified by email when new papers are published related to Offline Imitation Learning in Contextual Bandits.

Offline Imitation Learning in Contextual Bandits

1. Problem Formulation and Contextual Bandit Model

2. Inverse Probability Weighted Estimation and Its Limitations

3. Policy Improvement Objectives and Policy Imitation Regularization

4. Probability Logging, Diagnosability, and Confounding

5. Simulation Results and Empirical Insights

6. Practical Implications and Deployment Considerations

7. Summary Table: Key Techniques and Formulae

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Offline Imitation Learning in Contextual Bandits

1. Problem Formulation and Contextual Bandit Model

2. Inverse Probability Weighted Estimation and Its Limitations

3. Policy Improvement Objectives and Policy Imitation Regularization

4. Probability Logging, Diagnosability, and Confounding

5. Simulation Results and Empirical Insights

6. Practical Implications and Deployment Considerations

7. Summary Table: Key Techniques and Formulae

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research