Convex Inverse Reinforcement Learning (CIRL)

Updated 6 January 2026

Convex Inverse Reinforcement Learning is a method that frames reward inference as a convex optimization problem, ensuring global optimality and robustness.
It leverages MAP estimation, strong regularizers, and Gaussian process priors to overcome issues like local minima and ill-posed identifiability in traditional IRL.
CIRL extends to various settings including discrete, continuous, and contextual environments, offering computational tractability and provable performance guarantees.

Convex Inverse Reinforcement Learning (CIRL) refers to a class of IRL methodologies in which the reward inference problem is posed as a convex or strongly convex optimization, typically via MAP estimation, with tractable guarantees of global optimality and efficient solution methods. These approaches address fundamental shortcomings of non-convex IRL—such as sensitivity to initialization, local minima, and ill-posed identifiability—by exploiting convex structure in either the reward regularization, the policy regularizer, or the probabilistic likelihood. Modern CIRL frameworks are applicable across discrete, continuous, and contextual settings, and support extensions to constrained MDPs, approximate expert policies, nonparametric reward functions, and real-world data regimes.

1. Mathematical Formulation and Convexity Conditions

CIRL is fundamentally characterized by the formulation of reward inference as a convex optimization problem. In the finite-state setting, consider an MDP with state space $S=\{s_1,\dots,s_n\}$ , action space $A=\{a_1,\dots,a_m\}$ , discount factor $\gamma$ , and an unknown reward function $r:S\to\mathbb{R}$ . For fully observed stationary expert policies $\pi^*(s)=a^*$ , Bellman optimality yields affine constraints:

$(P_{\pi^*} - P_a)(I - \gamma P_{\pi^*})^{-1}r \geq 0,\quad \forall a \neq \pi^*(s),$

where $P_a$ are per-action transition matrices (Qiao et al., 2012). By placing a Gaussian prior $r\sim \mathcal{N}(\mu_r, \Sigma_r)$ , the negative log-posterior becomes:

$\min_{r \in \mathbb{R}^n}\;\; \frac12(r-\mu_r)^\top \Sigma_r^{-1} (r-\mu_r)\quad \text{subject to } A r \geq 0,$

with $A$ constructed from the Bellman constraints. This is a convex quadratic program (QP), ensuring global optimality and robustness to local minima.

In regularized IRL, the reward is learned such that the expert's policy $\pi_E$ uniquely maximizes regularized expected return:

$J_\Omega(r, \pi) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t (r(s_t, a_t) - \Omega(\pi(\cdot|s_t))) \right],$

where $\Omega$ is any strongly convex policy regularizer (e.g., negative Shannon entropy). The convex IRL objective is:

$\arg\max_r \; J_\Omega(r, \pi_E) - \max_\pi J_\Omega(r, \pi)$

(Jeon et al., 2020). Strong convexity of $\Omega$ yields uniqueness of optimal policy and well-posedness of the IRL solution.

For demonstration-driven, non-parametric IRL, a Gaussian Process prior over reward functions extends convexity to large or infinite state spaces by encoding expert preferences as directed preference graphs, reducing inference to convex optimization over latent GP values (Qiao et al., 2012).

2. Bayesian MAP, Likelihood Structures, and Parametric Extensions

MAP estimation in CIRL leverages probabilistic action likelihoods that are log-concave in the reward or value parameters. Under a softmax policy likelihood, the negative log-likelihood for observed data $D=\{(s_t,a_t)\}$ and Q-function parameterization $Q(s,a) = \phi_Q(s,a)^\top w_Q$ (or $r = \phi_R(s,a)^\top w_R$ ) is:

$J(\theta) = -\sum_{(s,a)\in D} [\beta \phi(s,a)^\top \theta - \log \sum_{a'} e^{\beta \phi(s,a')^\top \theta}],$

where $\theta$ encompasses the (absorbed) policy softness (Tossou et al., 2013, Tossou et al., 2014). $J(\theta)$ is convex in $\theta$ by properties of log-sum-exp and linear maps, enabling superlinear convergence with first/second-order solvers (L-BFGS).

Convex IRL is preserved with regularization terms (e.g., $\lambda \|\theta\|_p$ for $p\in\{1,2\}$ ), kernelized extensions (via RKHS representations), or multi-reward/hierarchical block-sparse regularizers. Feature scaling and cross-validation for hyperparameters (temperature, penalties) are recommended for empirical robustness (Tossou et al., 2013).

3. Nonparametric and Structured Preference Integration

For infinite or large state/action spaces, CIRL employs action-indexed Gaussian Processes $r_{a_j}(\cdot) \sim \text{GP}(0, k_{a_j})$ , with squared-exponential kernels and learnable hyperparameters. Expert demonstrations are encoded as preference graphs: for each state $s_i$ , $V_+$ contains chosen actions, $V_-$ the rejects, and strict or equivalent edges encode:

$u \to v$ : $Q(s_i,u) \geq Q(s_i,v)$ , mapped to probit likelihoods via $\mathbb{P}(u \to v \mid r) = \Phi((Q(s_i,u)-Q(s_i,v))/\sqrt{2}\sigma)$ ;
$u \leftrightarrow v$ : $Q(s_i,u) \approx Q(s_i,v)$ , penalized quadratically (Qiao et al., 2012).

The joint posterior over latent reward values aggregates all preference-induced constraints:

$U(r) = \frac12 r^\top K^{-1} r - \sum_{i,\ell} \log \Phi(z_i^\ell) + \frac12 \sum_{i,k} [Q(s_i,u_k) - Q(s_i,v_k)]^2$

with $U(r)$ convex in $r$ ; global minimization yields interpolated reward predictions at unobserved states via standard GP posterior formulas.

4. Algorithmic Implementations and Computational Complexity

CIRL supports multiple algorithmic pipelines:

Finite QP / CVXPY: Assemble Bellman-affine $A$ and solve convex QP for $r$ or parameter vector $\theta$ , with optional $\ell_1$ or $\ell_2$ regularization. Complexity: $O(n^3)$ for dense $A$ (Zhu et al., 27 Jan 2025).
Trajectory-based extensions: For locally optimal or noisy expert demonstrations, partition trajectory, build local transition matrices, and introduce slack variables for noisy constraints. Bisection auto-tunes sparsity hyperparameters $\lambda$ , enforcing criticality in regularization.
GP-based Newton/quasi-Newton: Convexity of the objective allows efficient global optimization with sparse GP or subset-of-regressor approaches, scaling to $O((mN)^2 K)$ for sparse GP (with $K \ll mN$ ) (Qiao et al., 2012).
Preference Graphs: Integration of preference data scales linearly with graph-edge count and is robust to incomplete information.
Adversarial/Regularized Algorithms: When a strongly convex regularizer is present, iterative policy and reward updates minimize Bregman divergences between learned and expert visitations; theoretical guarantees of convergence obtain (Jeon et al., 2020).

5. Identifiability, Generalizability, and Theoretical Guarantees

CIRL exposes identifiability up to potential shaping under entropy regularization, but in the presence of constraints or non-Shannon regularizers, only identification up to very restricted equivalence classes may be possible (Schlaginhaufen et al., 2023). The key optimality criterion for expert occupancy $\mu^E$ is:

$r \in \partial f(\mu) + N_\mathcal{F}(\mu),$

where $N_\mathcal{F}(\mu)$ is the normal cone to constraints, and $f$ the regularizer. Entropy regularization makes $\partial f$ analagous to policy logit derivation, so rewards are recoverable up to functions in the span of transition structure (Schlaginhaufen et al., 2023). Generalizability to novel MDPs or safety budgets requires recovery of expert reward up to a global constant offset.

Finite-sample guarantees provide error bounds on learned rewards and policies as $O(1/\sqrt{N})$ for $N$ trajectories and ensure efficient convergence in practice.

6. Contextual and Structured Extensions

CIRL generalizes to contextual MDPs, where context $c$ parametrizes both transitions and reward:

$R_c^*(s) = f^*(c)^\top \phi(s),\quad f(c;\theta) = c^\top W$

for reward mapping $f$ with learnable $W$ (Belogolovsky et al., 2019). The convex loss:

$\text{Loss}(W) = \mathbb{E}_{c} [\max_{\pi} f(c;W) \cdot (\mu^\pi_c - \mu^*_c)],$

is minimized by subgradient (mirror-descent, PSGD, EW) or evolution strategies. Provable rates $O(1/\sqrt{T})$ yield zero-shot transfer to unseen contexts, outperforming behavioral cloning and non-convex IRL in generalization error.

7. Practical Impact and Applications

Convex IRL methods have demonstrated strong empirical performance:

Grid-worlds and mountain car: CPIRL and GPIRL recover optimal rewards faster and with less variance than linear-feature IRL (Qiao et al., 2012);
Blackjack, Tic-Tac-Toe: convex IRL achieves near-expert loss, even against adversary policies, outperforming many baselines (Tossou et al., 2013);
Sepsis treatment: contextual CIRL gives superior action-matching rates relative to BC and standard IRL (Belogolovsky et al., 2019).
Marketing: MaxEnt convex IRL yields interpretable consumer preference parameters and tractable calibration (Halperin, 2017).
Cooperative settings (CIRL games): exponential reduction in planning via modified Bellman backups enables value alignment and pedagogic human behavior modeling (Malik et al., 2018).
Continuous control: regularized adversarial IRL yields analytical feedback laws and global stability certificates (SOS program, Lyapunov constraints) (Tesfazgi et al., 2024, Jeon et al., 2020).

Convexity induces algorithmic reproducibility, strong duality, and tractable generalization to highly structured or noisy demonstration domains.

In summary, Convex Inverse Reinforcement Learning constitutes a robust, mathematically principled approach to reward and preference inference from demonstration. By centering optimization in convex spaces—via QP, MAP estimation with log-concave likelihoods, Gaussian process nonparametrics, regularized occupancy flows, or Bregman divergence minimization—CIRL ensures global recoverability, computational tractability, and extensibility to modern RL settings, including safety constraints, continuous domains, contextual transfer, and cooperative multi-agent interactions (Qiao et al., 2012, Tossou et al., 2013, Tossou et al., 2014, Belogolovsky et al., 2019, Schlaginhaufen et al., 2023, Zhu et al., 27 Jan 2025, Jeon et al., 2020, Tesfazgi et al., 2024, Malik et al., 2018, Halperin, 2017).