Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 159 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 118 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Offline Imitation Learning in Contextual Bandits

Updated 20 October 2025
  • The paper introduces a PIL-IML framework that leverages a surrogate objective to approximate the reward-maximizing policy while controlling high importance weight variance.
  • It employs reward-weighted cross-entropy when logging probabilities are missing, ensuring robust policy learning despite incomplete propensity scores and confounding variables.
  • Empirical evaluations on simulated and real-world datasets demonstrate improved performance over traditional IPWE methods, supporting safe deployment and effective diagnostic assessment.

Offline imitation learning in contextual bandits is the paper and practice of inferring a reward-maximizing policy from static, logged data generated by a historical decision-making system operating under the contextual bandit paradigm. This learning challenge centers on how to utilize data comprised of context–action–reward tuples (often with randomized action selection and possibly missing probabilities), to construct policies that can either safely imitate—or improve upon—the historical behavior while avoiding the pitfalls of confounding, distribution shift, and high variance estimation.

1. Problem Formulation and Contextual Bandit Model

The offline contextual bandit model considered in (Ma et al., 2019) collects data consisting of tuples (x,a,r)(x, a, r), where %%%%1%%%% denotes the context, aa the chosen action (by a logging policy μ\mu), and rr the observed (non-negative) reward. Notably, action selection is often randomized by μ\mu, and rewards exist only for the actions actually taken. Furthermore, unobserved confounders hh that influence both aa and rr may be present, resulting in the data-generating process:

  • (x,h)P(x,h)(x, h) \sim P(x, h),
  • aμ(ax,h)a \sim \mu(a|x, h) (possibly context- and confounder-dependent action set),
  • rP(rx,h,a)r \sim P(r|x,h,a).

The learning objective is to construct a new policy π(ax)\pi(a|x) that, if deployed, will maximize expected reward:

Eπ[r]=Ex[aA(x)π(ax)r(x,a)],\mathbb{E}_{\pi}[r] = \mathbb{E}_x\left[ \sum_{a \in A(x)} \pi(a|x) r(x, a) \right],

despite having only empirical feedback from the logged policy rather than all possible actions under each context.

2. Inverse Probability Weighted Estimation and Its Limitations

A canonical approach for off-policy evaluation and imitation in offline contextual bandits is inverse probability weighted estimation (IPWE):

ΔIPWE(π)=1ni=1n(π(aixi)μ(aixi)1)ri,\Delta \mathrm{IPWE}(\pi) = \frac{1}{n} \sum_{i=1}^n \left( \frac{\pi(a_i|x_i)}{\mu(a_i|x_i)} - 1 \right) r_i,

where wi=π(aixi)/μ(aixi)w_i = \pi(a_i|x_i)/\mu(a_i|x_i) is the importance weight. This estimator is unbiased under known propensities but is vulnerable to:

  • Unavailable or missing logging probabilities (engineering/collection limitations),
  • Small logging probabilities: If μ(ax)\mu(a|x) is close to zero, ww becomes large, leading to extremely high variance in ΔIPWE\Delta \mathrm{IPWE}, with single rare events potentially dominating the sum and rendering confidence intervals and significance testing unreliable.

3. Policy Improvement Objectives and Policy Imitation Regularization

To address IPWE's variance and applicability limitations, a policy improvement objective (PIL) is proposed. PIL is a lower-bound surrogate to IPWE, derived from the inequality logww1\log w \leq w - 1. With available probabilities, a useful lower bound is:

PILμ(π)=1ni=1nri[logwi1{wi1}+(wi1)1{wi<1}],\mathrm{PIL}_\mu(\pi) = \frac{1}{n} \sum_{i=1}^n r_i [\log w_i \cdot 1_{\{w_i \geq 1\}} + (w_i - 1) \cdot 1_{\{w_i < 1\}}],

whereas with missing μ\mu, the formulation reverts to reward-weighted cross-entropy:

PIL(π)=1ni=1nrilogwi.\mathrm{PIL}_\emptyset(\pi) = \frac{1}{n} \sum_{i=1}^n r_i \log w_i.

Policy Imitation Learning (IML) regularizes this objective:

IMLpart(π)=1ni=1nlog(π(aixi)μ(aixi)),\mathrm{IML}_{\text{part}}(\pi) = -\frac{1}{n}\sum_{i=1}^n \log\left(\frac{\pi(a_i|x_i)}{\mu(a_i|x_i)}\right),

which is the KL-divergence KL(μπ)\mathrm{KL}(\mu \Vert \pi) averaged over μ\mu—minimizing IML encourages π\pi to mimic the logging policy, thereby reducing the variance of the importance weights. Explicitly, Taylor expansion reveals:

logw12(w1)2,-\log w \approx \frac{1}{2} (w - 1)^2,

making the average IML loss a direct proxy for the variance of IPWE.

The unified learning objective is:

maxπ PIL(r,π)+ϵIML(π),\max_\pi\ \mathrm{PIL}(r, \pi) + \epsilon\, \mathrm{IML}(\pi),

where ϵ\epsilon governs the exploitation–exploration (variance–bias) balance. When logging probabilities are unavailable, reward-weighted cross-entropy (as in standard supervised learning) is shown to be a justifiable surrogate.

4. Probability Logging, Diagnosability, and Confounding

Probability logging, the practice of storing μ(ax)\mu(a|x), serves two crucial purposes:

  • Bias Correction: With complete action propensities, unbiased IPWE or PIL can be performed.
  • Model Diagnosability: High IML loss (or high perplexity) is a diagnostic for either confounding (missing variables hh) or policy class misspecification, since the logging policy cannot be well-explained (or imitated) using the available model class. Thus, IML underfitting flags hidden influences and motivates model refinement or careful policy deployment.

The framework is thus adaptive: with missing probabilities, it defaults to robustly regularized cross-entropy; with full propensities, it can both debias and analyze model misspecification.

5. Simulation Results and Empirical Insights

Simulation studies validate the framework on both classical and real-world datasets:

  • Simpson’s Paradox (Kidney Stone Data): By modeling both observed (size) and hidden confounders during randomized assignment, the PIL-IML approach correctly re-weights to recommend the effective treatment, overcoming the paradox that plagues unadjusted analyses.
  • UCI Multiclass-to-Bandit Conversions: Benchmarking against Q-learning, vanilla IPWE, and doubly robust estimators, the PIL-IML approach demonstrates lower variance and superior performance, particularly under model misspecification. The reward-weighted cross-entropy surrogate proves robust when action propensities are missing.
  • Criteo Counterfactual Data: Facing extreme heavy-tail importance weights (up to 49,000), IPWE is unusable without variance control. PIL-IML, along with weight clipping and bootstrapping, produces usable estimates, and reveals via IML that essential confounders are not available in the dataset—a practical diagnostic for real-world data quality limitations.

Moreover, the method supports IML-resampling: Using the learned imitation policy to resample the logged data, thereby increasing exploration and improving subsequent learning.

6. Practical Implications and Deployment Considerations

  • Variance Control: Explicit regularization by KL-divergence (IML) is effective at controlling the variance of offline policy improvement, which is especially critical in high-dimensional action spaces or sparse data regimes.
  • Policy Class Selection: The IML diagnostic (high perplexity) provides a principled way to assess whether the chosen policy parameterization is adequate or confounded, guiding both model selection and safe policy deployment.
  • Handling Missing Data: The explicit connection between cross-entropy loss and the IPWE surrogate allows practitioners to deploy offline imitation learning without strict logging requirements, making the approach robust to operational and engineering challenges.
  • Future Optimization: Weight clipping, reward-weighted losses, and IML-resampling are recommended when deploying in environments with heavy-tailed propensities or suspected confounding.

7. Summary Table: Key Techniques and Formulae

Technique / Concept Mathematical Expression Role / Purpose
IPWE 1n(wi1)ri\frac{1}{n} \sum (w_i - 1) r_i Unbiased policy value estimation
Weight Clipping, PIL see above for PIL expressions Variance reduction, lower-bounding
Reward-weighted Cross-Entropy (CE) (1/n)rilogπ(aixi)-(1/n) \sum r_i \log \pi(a_i|x_i) Surrogate for missing propensities
IML Regularization 1nlog(π/μ)-\frac{1}{n} \sum \log(\pi/\mu), approximates variance Variance control, misspecification
Total Objective maxπ PIL(r,π)+ϵIML(π)\max_\pi\ \mathrm{PIL}(r,\pi) + \epsilon\, \mathrm{IML}(\pi) Combines policy improvement & control
Policy Diagnosability via IML High IML loss     \iff confounding or misspecification Data/model quality assessment
Greedy Policy Update Gradient of PIL-IML objective Local equivalence to natural gradient

This framework—anchored in convex surrogates for high-variance estimators and regularization by policy imitation—provides both sound theoretical guarantees and empirical evidence for its effectiveness in offline imitation learning for contextual bandits. The diagnostic properties of IML loss, adaptability to incomplete logs, and robust empirical performance across datasets establish it as a foundational approach for real-world, reliable offline policy learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Offline Imitation Learning in Contextual Bandits.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube