Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inverse Batched Contextual Bandit (IBCB)

Updated 29 March 2026
  • IBCB is a framework that infers evolving decision-maker beliefs and rewards from a single offline log of context–action pairs, capturing non-stationary policy evolution.
  • It employs Bayesian techniques with both parametric (linear–Gaussian) and nonparametric (Gaussian Process) models to reconstruct dynamic belief trajectories.
  • Empirical evaluations demonstrate its effectiveness over traditional IRL benchmarks, especially in applications like clinical audits and policy transparency.

The Inverse Batched Contextual Bandit (IBCB) framework addresses the problem of inferring the evolving behavioral mechanisms of decision-makers from a single offline log of contextual bandit interactions. Unlike classical imitation learning or inverse reinforcement learning (IRL), which presuppose that observed behavior emanates from a stationary expert, IBCB targets settings—such as recommender systems and clinical practice audits—where the agent's policy and internal knowledge exhibit non-stationary evolution over time. By leveraging batched logs of context–action pairs, IBCB seeks to reconstruct both the underlying reward function and the trajectory of belief states that best explain the observed policy evolution, enabling interpretable and temporally resolved insight into strategy adaptation (Hüyük et al., 2021).

1. Problem Setting and Formalization

The IBCB problem is instantiated as follows. Let XX denote the context space and AA the finite action space. At each timestep t=1,,Tt=1,\dots,T, a decision-maker observes a context xtXx_t\in X and selects an action atAa_t\in A, possibly based on context–action feature vectors xt[a]Rkx_t[a]\in\mathbb{R}^k. The environment is parameterized by ρenvP\rho_\mathrm{env}\in P, such that given (xt,at)(x_t,a_t), the reward rtr_t is sampled as rtRρenv(xt,at)r_t\sim \mathcal{R}_{\rho_\mathrm{env}(x_t,a_t)}, where Rρ\mathcal{R}_\rho is a known reward distribution family (e.g., Gaussian with mean ρ,x[a]\langle\rho,x[a]\rangle and fixed variance σ2\sigma^2).

The learner does not observe rewards or the current belief state. At each tt, the agent maintains a belief βtB\beta_t\in B (e.g., a posterior over reward parameters) and selects actions according to

πβt(xt)[a]=EρPβt[πρ(xt)[a]],\pi_{\beta_t}(x_t)[a] = \mathbb{E}_{\rho\sim\mathcal{P}_{\beta_t}}\left[\pi^*_\rho(x_t)[a]\right],

where πρ\pi^*_\rho is the greedy or soft-optimal policy for a given ρ\rho.

After action selection and (unobserved) reward, the belief updates as βt+1f(βt,xt,at,rt)\beta_{t+1}\sim f(\beta_t, x_t, a_t, r_t). In the offline (batched) inverse setting, access is limited to a single log D={(xt,at)}t=1T\mathcal{D} = \{(x_t, a_t)\}_{t=1}^T, with the aim to infer both ρenv\rho_\mathrm{env} and the most likely belief trajectory β1:T\beta_{1:T} that explains the observed actions, assuming a prior over ρenv\rho_\mathrm{env} and initial β1\beta_1 (Hüyük et al., 2021).

2. Principled Modeling of Policy Evolution

To capture non-stationarity in policy adaptation, IBCB posits that the agent's internal beliefs (and therefore policies) evolve over time. Two modeling paradigms are introduced:

A. Parametric (Bayesian) model:

The agent's beliefs comprise a family Pβ=N(μ,Σ)\mathcal{P}_\beta = \mathcal{N}(\mu,\Sigma) over ρ\rho, with βt=(μt,Σt)\beta_t = (\mu_t, \Sigma_t). Bayesian updates for linear–Gaussian reward models are employed: \begin{align*} \mu_{t+1} &= \Sigma_{t+1}\left( \Sigma_t{-1}\mu_t + \frac{1}{\sigma2}r_t x_t[a_t] \right), \ \Sigma_{t+1} &= \left(\Sigma_t{-1} + \frac{1}{\sigma2}x_t[a_t]x_t[a_t]{\top}\right){-1}. \end{align*} Action probabilities under a sampled ρt\rho_t are soft-optimal: πρt(xt)[a]=exp(αE[rρt,xt,a])aexp(αE[rρt,xt,a]).\pi^*_{\rho_t}(x_t)[a] = \frac{\exp\left(\alpha\,\mathbb{E}[r \mid \rho_t, x_t, a]\right)}{\sum_{a'} \exp\left(\alpha\,\mathbb{E}[r \mid \rho_t, x_t, a']\right)}.

B. Nonparametric (Gaussian Process) model:

Belief trajectories β1:T\beta_{1:T} are governed by a Gaussian process prior,

vec(β1:T)N(0,ΣTΣB),\operatorname{vec}(\beta_{1:T}) \sim \mathcal{N}\left(0,\,\Sigma_T\otimes\Sigma_B\right),

with Brownian increments enforcing temporal smoothness. At each tt, ρtN(βt,ΣP)\rho_t\sim\mathcal{N}(\beta_t, \Sigma_P) and atπρt(xt)a_t\sim\pi^*_{\rho_t}(x_t).

Both frameworks describe the likelihood of the progression of hidden states and observed actions, facilitating inference over both reward parameters and behavioral trajectories.

3. Algorithms: Bayesian Inference in IBCB

IBCB employs Bayesian inference strategies via Gibbs-style samplers to contend with latent variables (rewards, beliefs, parameters):

A. Parametric Bayesian ICB:

An expectation–maximization (EM) procedure iterates between:

  • E-step: Drawing samples {r1:T,ρ1:T}\{r_{1:T}, \rho_{1:T}\} from the posterior conditional on current parameters. Lemma 1 provides an exact sampler for r1:Tr_{1:T} given ρ1:T\rho_{1:T} under the linear–Gaussian model; Metropolis–Hastings is used to sample ρt\rho_t.
  • M-step: Maximizing expected log-joint likelihood E[logP(r(i),ρ(i),Dρenv,β1)]\mathbb{E}[\log P(r^{(i)},\rho^{(i)},\mathcal{D}|\rho_\mathrm{env},\beta_1)] for updates to (ρenv,β1)(\rho_\mathrm{env},\beta_1).

B. Nonparametric Bayesian ICB:

A Gibbs sampler alternates:

  1. Sampling β1:T\beta_{1:T} from the GP posterior (Lemma 2).
  2. Sampling ρt\rho_t for each tt via Metropolis–Hastings.

Samples of β1:T\beta_{1:T} constitute the inferred non-stationary preference (belief) trajectories (Hüyük et al., 2021).

4. Theoretical Properties

IBCB algorithms leverage two key conditional Gaussian lemmas:

  • For fixed parameters, rewards r1:Tr_{1:T} are conditionally Gaussian.
  • Under a GP prior on beliefs, vec(β1:T)\mathrm{vec}(\beta_{1:T}) conditional on parameters is also Gaussian.

These closed-form relations enable efficient exact sampling steps within the inference algorithms. Complete consistency and convergence rates are not established and are noted as future work directions. The framework is agnostic to the form of the exploration/exploitation algorithm used by the observed agent, provided it conforms to the structured bandit model above.

5. Empirical Evaluation

Evaluation of IBCB spans both real-world and synthetic agent environments:

  • Data Sources:
    • Historical U.S. OPTN registry logs for liver transplants, with contextual features per patient and selected/transplanted actions.
    • Semi-synthetic datasets in which actions are generated by simulated bandit agents (stationary, Thompson-sampling, optimistic, greedy, step-change, linear drift, and regressing behaviors).
  • Benchmarks:
    • Standard Bayesian IRL (B-IRL; assumes stationarity)
    • MM-fold IRL (interval-wise estimation)
    • CP-IRL (change-point segmentation)
    • I-SPI (inverse soft-policy improvement)
    • T-REX (rank-based IRL)
  • Metrics:
    • Belief-error: 1TtEρPβt[ρ]EρPβ^t[ρ]1\frac{1}{T}\sum_t \|\mathbb{E}_{\rho\sim\mathcal{P}_{\beta_t}}[\rho] - \mathbb{E}_{\rho\sim\mathcal{P}_{\hat{\beta}_t}}[\rho]\|_1
    • Reward-error: ρenvρ^env1\|\rho_\mathrm{env} - \hat{\rho}_\mathrm{env}\|_1
    • Action-matching: KL divergence between true and estimated πt\pi_t
  • Key Findings:
    • Bayesian ICB achieves leading belief- and reward-recovery accuracy when the agent's evolution is well-approximated by Bayesian updating (stationary, Thompson-sampling, optimistic, or greedy agents).
    • The nonparametric GP ICB excels in settings with more general and nonlinear behavioral drift (stepping, linear, regressing).
    • Benchmarks that segment time or assume stationarity underperform due to their inability to share information across timesteps.
    • Real-world interpretability: In OPTN liver allocation, the inferred NB-ICB time series mirrors major policy interventions, e.g., shifts in the weight of INR/creatinine post-2002 MELD-score adoption and after exception point caps in 2015.

6. Significance and Implications

IBCB generalizes the classical IRL and non-contextual inverse bandit paradigms by:

  • Permitting inference solely from a single offline interaction log, with no active or online experimentation.
  • Recovering the full trajectory of belief or reward priorities, elucidating non-stationary policy adaptation.
  • Achieving data efficiency via time-coupled priors, improving generalization even with limited data per epoch.
  • Producing interpretable, temporally resolved weights suitable for high-stakes audit and policy analysis applications.

A plausible implication is that IBCB can support forensic auditing and policy transparency in complex environments where decision strategies evolve dynamically and only historical data are available. This includes, for example, the auditing of medical decision policies as they adapt to changing guidelines and evidence, where direct reward observations and online intervention are infeasible.

Unlike classical IRL, which presupposes stationary experts and often relies on access to expert trajectories with known reward outcomes, IBCB operates under the constraint of bandit feedback—contexts and actions only—while accounting for non-stationary behavioral evolution. Furthermore, IBCB's joint inference of both global reward parameters and the non-stationary trajectory of beliefs is analytically distinct from myopic, piecewise, or purely stationary approaches (e.g., B-IRL, MM-fold IRL). The explicit modeling and recovery of evolving preferences, especially using joint Bayesian inference with parametric or nonparametric priors, distinguishes IBCB conceptually and methodologically from prior offline IRL and inverse bandit approaches (Hüyük et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inverse Batched Contextual Bandit (IBCB).