Inverse Batched Contextual Bandit (IBCB)

Updated 29 March 2026

IBCB is a framework that infers evolving decision-maker beliefs and rewards from a single offline log of context–action pairs, capturing non-stationary policy evolution.
It employs Bayesian techniques with both parametric (linear–Gaussian) and nonparametric (Gaussian Process) models to reconstruct dynamic belief trajectories.
Empirical evaluations demonstrate its effectiveness over traditional IRL benchmarks, especially in applications like clinical audits and policy transparency.

The Inverse Batched Contextual Bandit (IBCB) framework addresses the problem of inferring the evolving behavioral mechanisms of decision-makers from a single offline log of contextual bandit interactions. Unlike classical imitation learning or inverse reinforcement learning (IRL), which presuppose that observed behavior emanates from a stationary expert, IBCB targets settings—such as recommender systems and clinical practice audits—where the agent's policy and internal knowledge exhibit non-stationary evolution over time. By leveraging batched logs of context–action pairs, IBCB seeks to reconstruct both the underlying reward function and the trajectory of belief states that best explain the observed policy evolution, enabling interpretable and temporally resolved insight into strategy adaptation (Hüyük et al., 2021).

1. Problem Setting and Formalization

The IBCB problem is instantiated as follows. Let $X$ denote the context space and $A$ the finite action space. At each timestep $t=1,\dots,T$ , a decision-maker observes a context $x_t\in X$ and selects an action $a_t\in A$ , possibly based on context–action feature vectors $x_t[a]\in\mathbb{R}^k$ . The environment is parameterized by $\rho_\mathrm{env}\in P$ , such that given $(x_t,a_t)$ , the reward $r_t$ is sampled as $r_t\sim \mathcal{R}_{\rho_\mathrm{env}(x_t,a_t)}$ , where $\mathcal{R}_\rho$ is a known reward distribution family (e.g., Gaussian with mean $\langle\rho,x[a]\rangle$ and fixed variance $\sigma^2$ ).

The learner does not observe rewards or the current belief state. At each $t$ , the agent maintains a belief $\beta_t\in B$ (e.g., a posterior over reward parameters) and selects actions according to

$\pi_{\beta_t}(x_t)[a] = \mathbb{E}_{\rho\sim\mathcal{P}_{\beta_t}}\left[\pi^*_\rho(x_t)[a]\right],$

where $\pi^*_\rho$ is the greedy or soft-optimal policy for a given $\rho$ .

After action selection and (unobserved) reward, the belief updates as $\beta_{t+1}\sim f(\beta_t, x_t, a_t, r_t)$ . In the offline (batched) inverse setting, access is limited to a single log $\mathcal{D} = \{(x_t, a_t)\}_{t=1}^T$ , with the aim to infer both $\rho_\mathrm{env}$ and the most likely belief trajectory $\beta_{1:T}$ that explains the observed actions, assuming a prior over $\rho_\mathrm{env}$ and initial $\beta_1$ (Hüyük et al., 2021).

2. Principled Modeling of Policy Evolution

To capture non-stationarity in policy adaptation, IBCB posits that the agent's internal beliefs (and therefore policies) evolve over time. Two modeling paradigms are introduced:

A. Parametric (Bayesian) model:

The agent's beliefs comprise a family $\mathcal{P}_\beta = \mathcal{N}(\mu,\Sigma)$ over $\rho$ , with $\beta_t = (\mu_t, \Sigma_t)$ . Bayesian updates for linear–Gaussian reward models are employed: \begin{align*} \mu_{t+1} &= \Sigma_{t+1}\left( \Sigma_t^{-1}\mu_t + \frac{1}{\sigma^2}r_t x_t[a_t] \right), \ \Sigma_{t+1} &= \left(\Sigma_t^{-1} + \frac{1}{\sigma^{2}x_t[a_t]x_t[a_t]^{{\top}\right)^{-1}.}} \end{align*} Action probabilities under a sampled $\rho_t$ are soft-optimal: $\pi^*_{\rho_t}(x_t)[a] = \frac{\exp\left(\alpha\,\mathbb{E}[r \mid \rho_t, x_t, a]\right)}{\sum_{a'} \exp\left(\alpha\,\mathbb{E}[r \mid \rho_t, x_t, a']\right)}.$

B. Nonparametric (Gaussian Process) model:

Belief trajectories $\beta_{1:T}$ are governed by a Gaussian process prior,

$\operatorname{vec}(\beta_{1:T}) \sim \mathcal{N}\left(0,\,\Sigma_T\otimes\Sigma_B\right),$

with Brownian increments enforcing temporal smoothness. At each $t$ , $\rho_t\sim\mathcal{N}(\beta_t, \Sigma_P)$ and $a_t\sim\pi^*_{\rho_t}(x_t)$ .

Both frameworks describe the likelihood of the progression of hidden states and observed actions, facilitating inference over both reward parameters and behavioral trajectories.

3. Algorithms: Bayesian Inference in IBCB

IBCB employs Bayesian inference strategies via Gibbs-style samplers to contend with latent variables (rewards, beliefs, parameters):

A. Parametric Bayesian ICB:

An expectation–maximization (EM) procedure iterates between:

E-step: Drawing samples $\{r_{1:T}, \rho_{1:T}\}$ from the posterior conditional on current parameters. Lemma 1 provides an exact sampler for $r_{1:T}$ given $\rho_{1:T}$ under the linear–Gaussian model; Metropolis–Hastings is used to sample $\rho_t$ .
M-step: Maximizing expected log-joint likelihood $\mathbb{E}[\log P(r^{(i)},\rho^{(i)},\mathcal{D}|\rho_\mathrm{env},\beta_1)]$ for updates to $(\rho_\mathrm{env},\beta_1)$ .

B. Nonparametric Bayesian ICB:

A Gibbs sampler alternates:

Sampling $\beta_{1:T}$ from the GP posterior (Lemma 2).
Sampling $\rho_t$ for each $t$ via Metropolis–Hastings.

Samples of $\beta_{1:T}$ constitute the inferred non-stationary preference (belief) trajectories (Hüyük et al., 2021).

4. Theoretical Properties

IBCB algorithms leverage two key conditional Gaussian lemmas:

For fixed parameters, rewards $r_{1:T}$ are conditionally Gaussian.
Under a GP prior on beliefs, $\mathrm{vec}(\beta_{1:T})$ conditional on parameters is also Gaussian.

These closed-form relations enable efficient exact sampling steps within the inference algorithms. Complete consistency and convergence rates are not established and are noted as future work directions. The framework is agnostic to the form of the exploration/exploitation algorithm used by the observed agent, provided it conforms to the structured bandit model above.

5. Empirical Evaluation

Evaluation of IBCB spans both real-world and synthetic agent environments:

Data Sources:
- Historical U.S. OPTN registry logs for liver transplants, with contextual features per patient and selected/transplanted actions.
- Semi-synthetic datasets in which actions are generated by simulated bandit agents (stationary, Thompson-sampling, optimistic, greedy, step-change, linear drift, and regressing behaviors).
Benchmarks:
- Standard Bayesian IRL (B-IRL; assumes stationarity)
- $M$ -fold IRL (interval-wise estimation)
- CP-IRL (change-point segmentation)
- I-SPI (inverse soft-policy improvement)
- T-REX (rank-based IRL)
Metrics:
- Belief-error: $\frac{1}{T}\sum_t \|\mathbb{E}_{\rho\sim\mathcal{P}_{\beta_t}}[\rho] - \mathbb{E}_{\rho\sim\mathcal{P}_{\hat{\beta}_t}}[\rho]\|_1$
- Reward-error: $\|\rho_\mathrm{env} - \hat{\rho}_\mathrm{env}\|_1$
- Action-matching: KL divergence between true and estimated $\pi_t$
Key Findings:
- Bayesian ICB achieves leading belief- and reward-recovery accuracy when the agent's evolution is well-approximated by Bayesian updating (stationary, Thompson-sampling, optimistic, or greedy agents).
- The nonparametric GP ICB excels in settings with more general and nonlinear behavioral drift (stepping, linear, regressing).
- Benchmarks that segment time or assume stationarity underperform due to their inability to share information across timesteps.
- Real-world interpretability: In OPTN liver allocation, the inferred NB-ICB time series mirrors major policy interventions, e.g., shifts in the weight of INR/creatinine post-2002 MELD-score adoption and after exception point caps in 2015.

6. Significance and Implications

IBCB generalizes the classical IRL and non-contextual inverse bandit paradigms by:

Permitting inference solely from a single offline interaction log, with no active or online experimentation.
Recovering the full trajectory of belief or reward priorities, elucidating non-stationary policy adaptation.
Achieving data efficiency via time-coupled priors, improving generalization even with limited data per epoch.
Producing interpretable, temporally resolved weights suitable for high-stakes audit and policy analysis applications.

A plausible implication is that IBCB can support forensic auditing and policy transparency in complex environments where decision strategies evolve dynamically and only historical data are available. This includes, for example, the auditing of medical decision policies as they adapt to changing guidelines and evidence, where direct reward observations and online intervention are infeasible.

Unlike classical IRL, which presupposes stationary experts and often relies on access to expert trajectories with known reward outcomes, IBCB operates under the constraint of bandit feedback—contexts and actions only—while accounting for non-stationary behavioral evolution. Furthermore, IBCB's joint inference of both global reward parameters and the non-stationary trajectory of beliefs is analytically distinct from myopic, piecewise, or purely stationary approaches (e.g., B-IRL, $M$ -fold IRL). The explicit modeling and recovery of evolving preferences, especially using joint Bayesian inference with parametric or nonparametric priors, distinguishes IBCB conceptually and methodologically from prior offline IRL and inverse bandit approaches (Hüyük et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Inverse Contextual Bandits: Learning How Behavior Evolves over Time (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inverse Batched Contextual Bandit (IBCB).