Bayesian Inverse Reinforcement Learning

Updated 29 September 2025

Bayesian Inverse Reinforcement Learning is a framework that infers an agent’s hidden rewards and preferences by modeling uncertainties with prior distributions over rewards and policies.
It employs sampling methods like Metropolis–Hastings and Gibbs sampling to compute posterior distributions, enabling effective recovery of latent reward structures.
The approach outperforms traditional IRL methods by leveraging uncertainty, enhancing preference elicitation, and synthesizing improved policies even with sub-optimal demonstrations.

Bayesian Inverse Reinforcement Learning (IRL) frames the problem of inferring an agent’s reward (or preference) structure as a principled statistical inference problem under uncertainty. By treating the reward function, the policy, and potentially the observed reward realizations as latent variables, this approach yields a posterior distribution over preferences, policy, and reward given observed behavior, enabling both accurate preference elicitation and the synthesis of improved decision policies, even in the presence of sub-optimal demonstrations (Rothkopf et al., 2011).

1. Bayesian Model Formulation

In the Bayesian IRL framework, the environment is modeled as a known controlled Markov process (CMP), and the agent’s utility is defined by the total discounted stochastic reward: $U_t = \sum_{k=t}^\infty \gamma^k r_k,$ where $r_t$ is sampled from a latent reward function $r(\cdot | s,a)$ and $\gamma$ is the discount factor.

Bayesian IRL introduces priors over both the unknown reward functions and the latent policies:

Reward prior: $r \sim \rho(\cdot | \mathrm{CMP})$
Conditional policy prior: $\pi | r \sim \underline{\pi}(\cdot | r, \mathrm{CMP})$

The joint prior $j(P, R|\mathrm{CMP})$ is defined as: $j(P, R | \mathrm{CMP}) = \int_\mathcal{R} \underline{\pi}(P | r, \mathrm{CMP}) d\rho(r|\mathrm{CMP}),$ enabling explicit modeling of prior beliefs about agent rationality and policy stochasticity. A common example is a softmax policy: $\pi_\eta(a|s) = \frac{\exp(\eta Q^*(s,a))}{\sum_{a'} \exp(\eta Q^*(s,a'))},$ with $Q^*(s,a)$ the optimal Q-function under reward $r$ , and a prior (e.g., Gamma) over the inverse temperature $\eta$ , thus directly controlling policy optimality and stochasticity in the model.

2. Posterior Inference and Likelihood Structure

Given an observed trajectory $D = (s^1_T, a^1_T)$ , Bayesian IRL computes the posterior over rewards (and implicitly policies) using Bayes’ theorem. The likelihood of the demonstration for a stationary policy is: $\prod_{t=1}^T \pi(a_t|s_t).$

The full posterior for a reward function B (a subset of reward function space) is

$\rho(B|s^1_T, a^1_T, \mathrm{CMP}) = \frac{ \int_B \int_\Pi \left[\prod_{t=1}^T \pi(a_t | s_t)\right] d\underline{\pi}(\pi | r, \mathrm{CMP}) d\rho(r|\mathrm{CMP}) }{ \int_\mathcal{R} \int_\Pi \left[\prod_{t=1}^T \pi(a_t | s_t)\right] d\underline{\pi}(\pi | r, \mathrm{CMP}) d\rho(r|\mathrm{CMP}) }.$

Two model variants are described:

The “basic” model, where only the reward function and the policy are latent.
The “reward-augmented” model, where the reward sequence is also modeled explicitly.

Sampling-based inference is performed via Metropolis–Hastings (MH) algorithms, with proposals drawn jointly over (r, π), or, for the augmented model, using a two-stage Gibbs procedure for both the reward function and the realized reward sequence. This joint inference directly accounts for the uncertainty in both preferences and the behavior mapping.

3. Comparison with Other IRL Methods

The Bayesian IRL approach is contrasted with several established methods:

Linear Programming (LP) IRL: Finds a reward maximizing the value margin between best and second-best actions, assuming near-optimal demonstrators and struggling with small action gaps.
Maximum Entropy IRL: Constructs a trajectory distribution maximizing entropy subject to feature matching, which provides minimax properties but rarely derives a policy exceeding demonstrator performance.
Game Theoretic (MWAL) IRL: Seeks worst-case loss bounds via an adversarial (game-theoretic) objective, which is robust but often overly conservative and yields policies near but not surpassing the demonstration.

Bayesian IRL specifically provides:

Joint inference of a distribution, not just a point estimate.
Capacity to encode rich prior knowledge over both rewards and rationality/irregularity in action selection.
Theoretical and empirical capacity to recover preferences and synthesize policies that outperform even sub-optimal demonstrators.

Performance is sensitive to the expressivity and selection of priors for both $r$ and $\pi$ , as well as assumptions such as policy softmax–optimality, which constitute structural model limitations.

4. Empirical Evaluation and Quantitative Results

Experiments in Random MDPs (with four actions) and Random Maze domains—where the demonstrator follows a softmax policy (with respect to its own unknown Q-values)—demonstrate the following:

Policies derived from Bayesian IRL (especially with the MH sampler) frequently outperform the demonstrator’s own policy as measured by L₁ loss (sum of statewise value errors relative to the true-optimal policy for the agent’s actual reward).
The quality of inferred preferences and policy performance increases with both the length of demonstration trajectories and expansion of the state space.
The hybrid Gibbs sampler for the reward-augmented model gives competitive but slightly higher-variance outcomes compared to the basic MH sampler.
When compared directly against LP, policy walk, and MWAL, the Bayesian IRL samplers provide superior preference recovery and improved policy quality as measured by value difference under the agent’s true reward.

5. Handling Sub-Optimal Demonstrators and Preference Improvement

A critical empirical observation is that the Bayesian IRL framework accurately recovers the underlying reward function even when the demonstrator is significantly sub-optimal with respect to its own preferences. This is attributed to the posterior’s capacity to reflect persistent structure in preference-behavior mappings rather than relying solely on behavioral optimality or observed performance. As a consequence:

The IRL-inferred reward function often enables computation of an improved (greedy) policy that outperforms the actual demonstrator relative to the agent’s latent preferences.
This property is particularly valuable in practical or safety-critical contexts where observed demonstrations are noisy, inconsistent, or systematically sub-optimal due to limited exploration or information.

6. Implementation and Practical Considerations

Practical deployment of Bayesian IRL for preference elicitation involves:

Explicit model specification for priors over rewards (e.g., Gaussian, spike-and-slab) and policies (parameterized, e.g., via softmax temperature).
Design of efficient Markov chain Monte Carlo samplers (e.g., MH or hybrid Gibbs approaches) to traverse the often multi-modal and high-dimensional posterior landscape.
Integration of policy improvement steps (policy extraction given inferred rewards) for acting in accordance with elicited preferences, which is provably advantageous in the presence of sub-optimal demonstrators.

The joint and hierarchical modeling capability of Bayesian IRL permits flexibility in accounting for prior domain information, agent rationality, stochasticity in decisions, and arbitrary reward structures, all while maintaining sound statistical guarantees of inference.

7. Summary and Significance

Bayesian IRL, as instantiated via the preference elicitation framework (Rothkopf et al., 2011), formulates the task of learning agent preferences as full Bayesian inference, defining a joint prior over reward functions and policies and inferring the posterior from demonstration data. This structured approach:

Yields posterior distributions (and empirical samples) over rewards and policies.
Enables principled and flexible incorporation of prior knowledge.
Surpasses point-estimate methods, especially when demonstrators are imperfect.
Supports derivation of superior policies that genuinely improve on observed behavior, backed by both theoretical guarantees and empirical results in controlled discrete domains.

These features position Bayesian IRL as a statistically robust and empirically validated approach for preference elicitation, preference-based planning, and reliable policy improvement in complex sequential environments.

PDF Markdown Chat (Pro)

References (1)

Preference elicitation and inverse reinforcement learning (2011)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Bayesian Inverse Reinforcement Learning (IRL).