Planning-Based Inverse Reinforcement Learning

Updated 4 July 2026

Planning-based Inverse Reinforcement Learning is a framework that infers latent rewards from expert trajectories using forward planning in Markov decision processes.
It integrates Bellman optimality, maximum entropy, and regularization techniques to model expert behavior and account for non-optimal actions.
Recent advances mitigate computational bottlenecks through successor-feature pretraining, Q-space reparameterization, and expert-state resets, enhancing scalability.

Planning-based inverse reinforcement learning (IRL) denotes a family of methods in which behavior is modeled as the result of a forward planning computation, and reward inference proceeds by inverting that reward-to-policy map from demonstrations. In the canonical formulation, the expert is assumed to act optimally or approximately optimally in a Markov decision process (MDP), so the unknown object is a reward function, value function, or closely related latent objective whose induced policy explains the observed trajectories or actions (Suzuki, 2017, Bajgar et al., 2024, Abdulhai et al., 2022).

1. Formal foundations

A standard finite MDP in this literature is written as

$\mathcal{M} = (S, A, P_{sa}, \gamma, R),$

or equivalently

$\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$

with state space, action space, transition law, discount factor, and reward function. Planning-based IRL takes as given the environment dynamics and an observed policy or demonstration set, and seeks a reward $R$ such that the observed policy is optimal for the corresponding MDP. In the policy-based statement, the input is $(S,A,P_{sa},\gamma,\pi)$ and the output is a reward $R:S\times A\to \mathbb{R}$ satisfying $\pi=\pi^*$ ; in the trajectory-based statement, the data are demonstrations $\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ , and the reward is chosen so that forward planning under that reward reproduces the observed behavior (Suzuki, 2017).

The forward component is governed by Bellman optimality. For a candidate reward, the optimal value and policy satisfy

$V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$

and

$\pi^*(s)=\arg\max_{a\in A} Q^*(s,a).$

Planning-based IRL therefore couples inverse learning to repeated solution, approximation, or reformulation of this forward problem. In its most direct form, the inner loop is: assume a reward $R_\theta$ , solve the MDP, compare the induced policy or trajectories to demonstrations, and update $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 0 until convergence. This forward–inverse duality is the defining feature of the paradigm (Suzuki, 2017).

Many modern formulations replace hard optimality by Boltzmann-rational or maximum-entropy decision models. In Bayesian IRL and MaxEnt IRL, the likelihood of an action under a candidate value function is typically written as

$\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 1

or, in policy form,

$\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 2

This retains the planning semantics—actions remain value-driven—but softens the exact optimality assumption and makes probabilistic inference possible (Bajgar et al., 2024, Abdulhai et al., 2022).

A persistent conceptual limitation is that human or expert behavior may be ignorant, inconsistent, or otherwise non-optimal. Several papers therefore treat exact optimality as a modeling convenience rather than a literal behavioral law. This suggests that planning-based IRL is best understood as a family of inverse decision models whose success depends on how accurately the embedded planner captures the demonstrator’s effective decision process (Suzuki, 2017).

2. Canonical formulations and reward representations

Early and classical planning-based IRL methods encode the expert’s optimality as explicit constraints. In finite-state settings, Ng and Russell’s inequalities can be written as

$\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 3

which leads to convex quadratic or linear programs over rewards under a known policy. A Bayesian convex formulation places a Gaussian prior on the reward vector and then performs maximum a posteriori estimation under these Bellman-derived constraints, so the inverse problem is solved by constrained optimization rather than by direct policy fitting (Qiao et al., 2012).

A second canonical line models rewards nonparametrically. “Inverse Reinforcement Learning with Gaussian Process” represents action-conditioned rewards $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 4 as Gaussian processes and uses preference graphs built from demonstrations to express strict and equivalent action preferences in terms of Q-value comparisons. This retains the planning layer, because the likelihood of a preference edge depends on the Q-values induced by the current reward through the MDP dynamics, while replacing hand-specified linear rewards with a GP prior over functions (Qiao et al., 2012). “Inverse Reinforcement Learning via Deep Gaussian Process” extends the same idea by stacking latent GP layers and using the Maximum Entropy IRL likelihood as the planning engine, so representation learning and reward inference are optimized jointly while soft value iteration remains inside the loop (Jin et al., 2015).

Regularization supplies a third canonical reformulation. In “Regularized Inverse Reinforcement Learning,” the planner maximizes

$\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 5

where $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 6 is a strongly convex policy regularizer. Strong convexity gives a unique optimal policy for every reward and prevents the expert from being rationalized by arbitrary constant rewards. The paper derives a target reward from the gradient of the policy regularizer and shows that planning with that reward is equivalent to minimizing a discounted policy-level Bregman divergence to the expert policy. Maximum-entropy IRL appears as the special case $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 7, for which the induced divergence is the Kullback–Leibler divergence and the target reward reduces to $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 8 (Jeon et al., 2020).

Across these formulations, the central modeling choice is not only the reward class but the way planning is embedded: as exact Bellman inequalities, as a soft value function inside a probabilistic likelihood, or as a regularized control problem whose optimum is unique by construction.

3. Computational bottlenecks and planning amortization

The main computational cost of planning-based IRL is the need to solve a forward control problem for many candidate rewards. This cost is especially severe in Bayesian IRL, adversarial IRL, and maximum-entropy methods, where inference may require thousands of reward updates, each coupled to dynamic programming or reinforcement learning. Several recent methods attack this bottleneck directly: BASIS shifts most of the planning into offline successor-feature pretraining, ValueWalk samples in Q-space rather than reward space, and “Inverse Reinforcement Learning without Reinforcement Learning” restricts planning to expert state distributions instead of the full exploration problem (Abdulhai et al., 2022, Bajgar et al., 2024, Swamy et al., 2023).

Method	Core reduction of planning cost	Representative result
BASIS	Multi-task RL with successor features amortizes planning into pretraining	Accurately infers reward functions from less than 100 trajectories
ValueWalk	MCMC in Q-space with Bellman inversion replaces repeated forward RL solves	$\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, P, \mathcal{R}, \gamma \rangle,$ 9, $R$ 0, $R$ 1 s per effective sample on $R$ 2, $R$ 3, $R$ 4 states
IRL without RL	Expert-state resets convert inner RL into local moment matching / cost-sensitive classification	Replaces worst-case $R$ 5 interactions by polynomial dependence

BASIS assumes linear rewards in learned cumulants,

$R$ 6

and factorizes values as

$R$ 7

The heavy planning stage occurs once during multi-task RL pretraining, which learns shared cumulants $R$ 8, successor features $R$ 9, and task vectors $(S,A,P_{sa},\gamma,\pi)$ 0. IRL on a new task then reduces mainly to updating a new preference vector $(S,A,P_{sa},\gamma,\pi)$ 1 under a MaxEnt-style behavioral cloning loss plus an inverse temporal-difference consistency term, rather than re-solving a fresh MDP for every reward hypothesis. Empirically, BASIS “typically achieves a given EVD or reward MSE using less than 1/3 the number of demonstrations required by baselines,” reaches the best value difference in Fruit-Picking with “fewer than 1000 demonstrations,” and in Highway and Roundabout can converge after “1–100” demonstrations (Abdulhai et al., 2022).

ValueWalk makes a different reparameterization. Standard Bayesian IRL samples rewards $(S,A,P_{sa},\gamma,\pi)$ 2 and repeatedly solves for $(S,A,P_{sa},\gamma,\pi)$ 3. ValueWalk instead samples $(S,A,P_{sa},\gamma,\pi)$ 4 directly and reconstructs the corresponding reward by Bellman inversion,

$(S,A,P_{sa},\gamma,\pi)$ 5

or by the analogous one-step expectation in continuous settings. This turns forward planning from an iterative dynamic-programming routine into a single linear or expectation computation. In gridworld experiments, posterior samples from ValueWalk match those of reward-space samplers while the time per effective sample is reported as $(S,A,P_{sa},\gamma,\pi)$ 6, $(S,A,P_{sa},\gamma,\pi)$ 7, and $(S,A,P_{sa},\gamma,\pi)$ 8 seconds for $(S,A,P_{sa},\gamma,\pi)$ 9, $R:S\times A\to \mathbb{R}$ 0, and $R:S\times A\to \mathbb{R}$ 1 states, compared with $R:S\times A\to \mathbb{R}$ 2, $R:S\times A\to \mathbb{R}$ 3, and $R:S\times A\to \mathbb{R}$ 4 seconds for PolicyWalk (Bajgar et al., 2024).

“Inverse Reinforcement Learning without Reinforcement Learning” pushes the same agenda from a different angle. It shows that traditional primal and dual IRL reductions can require $R:S\times A\to \mathbb{R}$ 5 environment interactions in the worst case because each reward update embeds a full RL problem. Its MMDP and NRMM algorithms assume reset access to expert state distributions $R:S\times A\to \mathbb{R}$ 6, so the inner problem becomes dynamic programming or greedy improvement only on expert states. This yields per-iteration interaction complexity of

$R:S\times A\to \mathbb{R}$ 7

and an “exponential speedup in theory” relative to the naive reduction (Swamy et al., 2023).

Taken together, these methods do not remove planning from IRL. They relocate it: into offline successor-feature learning, into a value-space parameterization, or into expert-state-local subproblems.

4. Continuous state spaces, physics-based reformulations, and control-theoretic variants

Extending planning-based IRL beyond tabular MDPs has produced several structurally different approaches. One route is to retain Bellman-style inequalities but move to function-space representations. “Inverse Reinforcement Learning in a Continuous State Space with Formal Guarantees” models unknown transition operators and rewards in an orthonormal basis,

$R:S\times A\to \mathbb{R}$ 8

and represents each action’s dynamics by an infinite coefficient matrix $R:S\times A\to \mathbb{R}$ 9. The planning object is an advantage operator

$\pi=\pi^*$ 0

which converts Bellman optimality into linear inequalities in the reward coefficients. The resulting algorithm estimates transition matrices from samples, computes approximate $\pi=\pi^*$ 1, and solves a linear program over $\pi=\pi^*$ 2. It provides correctness and sample-complexity guarantees; in the main theorem, the sample size scales with $\pi=\pi^*$ 3, the separability margin $\pi=\pi^*$ 4, and the transition-operator norm bounds (Dexter et al., 2021).

A second route reframes planning through stochastic physics. FP-IRL assumes that trajectories in the lumped state–action space $\pi=\pi^*$ 5 follow an Itô stochastic differential equation

$\pi=\pi^*$ 6

with density evolution governed by the Fokker–Planck equation. The method conjectures an isomorphism

$\pi=\pi^*$ 7

so the inferred potential becomes a negative action-value function. From $\pi=\pi^*$ 8, FP-IRL constructs the transition model, obtains a Boltzmann policy

$\pi=\pi^*$ 9

and finally recovers the reward via the inverse Bellman equation

$\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 0

This preserves planning semantics while replacing repeated RL solves by a physics-constrained system-identification step (Huang et al., 2023).

A third route, “Inverse Reinforcement Learning: A Control Lyapunov Approach,” replaces explicit reward learning by direct inference of a Control Lyapunov Function $\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 1. The observed goal-directed expert is modeled as stabilizing a nonlinear control-affine system, and the inverse problem becomes finding a continuous $\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 2 such that

$\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 3

with equality at the target state. By inverse optimality, every such Lyapunov function is also a meaningful value function for some optimal control problem, so the learned $\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 4 plays the role of the latent planning objective. The paper then uses a closed-form stabilizing policy

$\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 5

instead of repeatedly solving an optimal control problem during learning (Tesfazgi et al., 2021).

These variants broaden the meaning of planning-based IRL. The planner need not always be value iteration; it may be a basis-operator solve, a Bellman inversion in value space, a Fokker–Planck potential, or a Lyapunov-stabilizing controller.

5. Structured intent: subgoals, symbolic tasks, and multiple planning horizons

Planning-based IRL also serves as a framework for latent structure beyond a single global reward. “Inverse Reinforcement Learning via Nonparametric Spatio-Temporal Subgoal Modeling” assumes that even a single trajectory can be better explained by local subgoals than by one global intention. Each subgoal $\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 6 induces a reward spike

$\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 7

and a corresponding planned policy based on $\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 8. The paper then uses distance-dependent Chinese Restaurant Processes to assign states or time indices to subgoal clusters, allowing intentions to vary over space or over time. Its normalized action likelihood is built from a rescaled $\mathcal{D}=\{\tau^{(1)},\dots,\tau^{(M)}\}$ 9, yielding posterior predictive policies that are smooth over state or time and remain consistent with the expert’s local plan (Šošić et al., 2018).

WFA-IRL introduces symbolic task structure into the planning loop. It first learns a weighted finite automaton $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 0 from demonstrations using spectral learning, where the automaton maps words over atomic propositions to scalar task scores. It then constructs a product between the automaton and a labeled MDP, so planning becomes a deterministic shortest-path problem in product space. The inverse stage learns a low-level cost $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 1 by differentiating the negative log-likelihood of expert actions through the planner via subgradients over optimal paths. The high-level logical specification and low-level motion costs are therefore inferred separately but coupled through explicit planning (Wang et al., 2021).

A related source of latent structure is heterogeneity in planning horizon. “Inverse Reinforcement Learning with Multiple Planning Horizons” studies the case where several experts optimize a shared reward but each uses a different unknown discount factor $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 2. The paper shows that freeing $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 3 enlarges the feasible solution space of rewards and can make naive extensions of standard IRL collapse several experts onto the same horizon. Its MPLP-IRL algorithm augments LP-based IRL with expert-specific distinguishability sets $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 4, while its MPMCE-IRL counterpart extends maximum causal entropy IRL. In both cases, discount factors are optimized in an outer loop by Bayesian optimization, and the reward is optimized in an inner loop under the current $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 5 (Yao et al., 2024).

These methods illustrate a general point: planning-based IRL is often less about recovering a scalar reward alone than about recovering the hidden structure of the planner—subgoals, symbolic logic, or horizon parameters—that makes the demonstrations coherent.

6. Applications, empirical behavior, and open problems

Application work has used planning-based IRL to build behaviorally grounded simulators and planners in domains where rules are difficult to specify manually. In participatory urban simulation, IRL is proposed as a “new paradigm” for generating agents in pedestrian movement and evacuation models. The intended workflow is to collect trajectories from GPS, Wi‑Fi, RFID, or participatory experiments, infer latent rewards from those demonstrations, and then deploy large numbers of agents that plan in an MDP under the learned reward. The paper emphasizes Kyoto Station evacuation and street-network pedestrian models as motivating examples, while also stressing that quantitative validation remains difficult and often falls back on qualitative judgments such as “the levels of reality participants experienced” (Suzuki, 2017).

In port operations, “Temporal-IRL” casts berth scheduling at Maher Terminal as a planning problem over 8-hour windows with a 19-slot state vector covering berths, waiting area, and incoming vessels. A temporal feature extractor based on an LSTM autoencoder feeds a Maximum Entropy IRL model that predicts berth assignments, waiting times, and departures. Using data from January 2015 to September 2023, the reported action accuracy is $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 6, congestion accuracy is $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 7, and leave accuracy is $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 8. These results position planning-based IRL as a forecasting method for congestion through learned operational priorities rather than through manually specified queueing rules (Li et al., 24 Jun 2025).

In autonomous driving, planning-based IRL has been integrated with conditional motion prediction. “Conditional Predictive Behavior Planning with Inverse Reinforcement Learning for Human-like Autonomous Driving” generates $V^*(s)=\max_{a\in A}\left[R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^*(s')\right],$ 9– $\pi^*(s)=\arg\max_{a\in A} Q^*(s,a).$ 0 kinematically feasible ego trajectory proposals over a $\pi^*(s)=\arg\max_{a\in A} Q^*(s,a).$ 1-second horizon, predicts other agents’ futures conditioned on each proposal, and then scores the proposals with a maximum-entropy IRL cost model. On the reported benchmark, the planning module achieves a minFDE of $\pi^*(s)=\arg\max_{a\in A} Q^*(s,a).$ 2 and top-3 plan accuracy of $\pi^*(s)=\arg\max_{a\in A} Q^*(s,a).$ 3, compared with $\pi^*(s)=\arg\max_{a\in A} Q^*(s,a).$ 4 and $\pi^*(s)=\arg\max_{a\in A} Q^*(s,a).$ 5 for a neural classifier baseline, while conditional prediction improves both prediction and planning relative to a non-conditional predictor (Huang et al., 2022).

Across the literature, several limitations recur. First, optimality assumptions remain fragile. Urban simulation work explicitly notes that people are often ignorant or inconsistent, and that IRL may misread systematic error as true preference (Suzuki, 2017). Second, computational complexity is still central: state-action spaces can grow exponentially in the number of agents, and even improved methods such as ValueWalk, BASIS, or expert-reset reductions are specialized responses rather than universal solutions (Bajgar et al., 2024, Abdulhai et al., 2022, Swamy et al., 2023). Third, identifiability remains partial. Regularized IRL removes the arbitrary-constant degeneracy but still admits potential-based shaping invariances, and multiple-planning-horizon IRL shows that unknown discount factors further enlarge the feasible reward set (Jeon et al., 2020, Yao et al., 2024). Fourth, model misspecification and partial observability remain common sources of error, whether the issue is sparse mobile-sensor data in urban planning, empirical rather than parametric transitions in port scheduling, or restricted task logic in automaton-based models (Suzuki, 2017, Li et al., 24 Jun 2025, Wang et al., 2021).

A plausible implication is that the future of planning-based IRL lies not in a single universal algorithm but in problem-specific combinations of planning structure, representation learning, and uncertainty modeling. The published record already contains several such combinations: successor-feature amortization, Q-space Bayesian inference, orthonormal-basis Bellman operators, Fokker–Planck inversion, CLF learning, symbolic automata, and heterogeneous-horizon inference. What unifies them is the same core commitment: reward inference is meaningful only through an explicit model of how rewards generate plans.