Meta-AIRL: Meta-Inverse Reinforcement Learning

Updated 21 January 2026

Meta-AIRL is a meta-learning framework that integrates inverse reinforcement learning to enable efficient reward and policy inference across tasks using limited expert demonstrations.
It leverages fast inner-loop adaptations and meta-optimization to rapidly infer rewards with few samples, reducing the sample inefficiencies seen in classic IRL.
Empirical results show Meta-AIRL methods achieve superior performance in domains like navigation, control, and driving, outperforming imitation-only baselines.

Meta-Inverse Reinforcement Learning (Meta-AIRL) refers to a family of algorithms that integrate meta-learning and inverse reinforcement learning (IRL) to enable efficient reward and policy inference in new tasks from limited demonstrations. Meta-AIRL frameworks leverage experience across a wide family of related tasks during meta-training to acquire inductive biases—either as an initialization or as latent embedding spaces—so that at meta-test time, successful inference and adaptation are possible from few or even single demonstrations. This paradigm addresses sample inefficiency and poor generalization in classic IRL by formalizing the amortization of prior experience into a structured learning procedure.

1. Problem Setup and Theoretical Foundations

Meta-AIRL operates over a family of tasks, each formalized as a Markov Decision Process (MDP) $T \sim p(T)$ , where $T= (\mathcal S, \mathcal A, p_T(s'|s,a), \gamma, r_T)$ , with the reward function $r_T$ unknown during policy learning. For meta-training, expert demonstrations $\mathcal D_{T_i} = \{\tau_{i,1}, ..., \tau_{i,K}\}$ are available for each sampled task $T_i$ ; each trajectory $\tau = (s_1, a_1, ..., s_L, a_L)$ is assumed drawn from the maximum-entropy expert distribution:

$p_{r_T}(\tau) = \frac{1}{Z_T} \exp\left( \sum_{t=1}^L \gamma^{t-1} r_T(s_t, a_t) \right)$

The meta-learning objective is to train shared parameters—either a network initialization $\theta$ (Xu et al., 2018), a latent context encoder (Yu et al., 2019), or coupled reward/policy parameters $\{\omega, \phi\}$ (Wang et al., 2021)—to rapidly adapt reward inference (and, in some cases, policy learning) to new tasks with very few demonstrations.

2. Core Methodologies and Algorithms

The principal variants of Meta-AIRL can be categorized as follows:

Method	Reward Representation	Policy Inference
MandRIL (Xu et al., 2018)	Network initialization $\theta$ (prior over rewards)	MaxEnt IRL via gradient adaptation
Meta-AIRL (PEMIRL) (Yu et al., 2019)	Context-conditional reward $f_\theta(s, a, c)$ , latent context $c$ inferred from demo	Conditional AIRL with latent embedding
REPTILE/AIRL (Wang et al., 2021)	Discriminator-parameterized reward $f_\omega(s,a)$	Joint meta-learned policy $\pi_\phi(a\|s)$ , both adapted by gradient steps

For MandRIL, adaptation consists of inner-loop Maximum-Entropy IRL updates on the reward network $\theta$ using demo data, followed by a meta-gradient step that optimizes the initial parameters to be amenable to quick adaptation:

$\phi_{T_i} = \theta - \alpha \nabla_\theta \mathcal L_{\rm IRL}(\theta; \mathcal D^{\rm tr}_{T_i})$

$J(\theta) = \sum_{i=1}^N \mathcal L_{\rm IRL}(\phi_{T_i}; \mathcal D^{\rm test}_{T_i})$

In probabilistic context meta-IRL (PEMIRL), a Gaussian latent context variable $c$ is inferred via encoder $q_\phi(c| \tau)$ . Rewards are expressed as $f_\theta(s, a, c)$ , learned by maximizing the expected log-likelihood of trajectories under inferred contexts, regularized by mutual information and approximated through adversarial IRL:

$\mathcal{L}(\theta, \phi) = \mathbb{E}_{\tau_E} [ \mathbb{E}_{c \sim q_\phi(c|\tau_E)} [ \log p_\theta(\tau_E | c) ] - \mathrm{KL}(q_\phi(c|\tau_E)\|p(c)) ]$

Meta-AIRL with adversarial structure (Wang et al., 2021) employs an inner-outer loop via REPTILE-style averaging over fast-adapted mini-batch updates, jointly for reward and policy networks. The discriminator $D_\omega(s,a)$ parameters the reward; the generator policy $\pi_\phi(a|s)$ is trained via maximum-entropy RL.

3. Meta-Training and Adaptation Procedures

The general algorithmic structure for Meta-AIRL involves alternation between fast inner-loop adaptation (to simulate test-time adaptation on a meta-train task) and slow meta-optimization (to adjust shared parameters for rapid adaptation):

Inner Loop (Task Adaptation)
- Given $\theta$ / $\{\omega, \phi\}$ , perform a small number of IRL/Adversarial IRL updates with the (limited) expert demonstrations to yield adapted task-specific parameters (MandRIL: $\phi_{T_i}$ , Meta-AIRL: $\{\omega_{T_i}, \phi_{T_i}\}$ ).
- For context-based models (Yu et al., 2019), infer a latent variable $c$ using encoder $q_\phi$ on the demo trajectory.
Outer Loop (Meta-Optimization)
- Evaluate outer loss—a held-out set of demonstrations, likelihood, or adversarial objective—using adapted parameters.
- Compute (possibly second-order) meta-gradient and update shared parameters toward improved adaptation performance.

The pseudocode for MandRIL (Xu et al., 2018) and REPTILE/AIRL (Wang et al., 2021) formalizes this process. At test time, few (often $\ll 10$ ) demonstrations are used for adaptation, sometimes just a single trajectory (PEMIRL).

4. Model Architectures and Implementation Details

Architectural choices are largely domain-dependent. In high-dimensional domains such as SpriteWorld, the reward network is a deep convolutional model mapping $80 \times 80$ RGB images to spatial reward maps (Xu et al., 2018). SUNCG experiments use a small conv-MLP that receives panoramic semantic-segmentation images as input. In driving and control domains (Wang et al., 2021, Yu et al., 2019), policies and reward networks are MLPs with 2–3 fully connected layers, typically with tanh or ReLU activations.

Inner-loop adaptation often uses vanilla SGD (MandRIL, PEMIRL), and meta-updates employ Adam (MandRIL) or bespoke schedule parameters (Meta-AIRL). Meta-AIRL on driving tasks uses joint policy/reward learning, leveraging TRPO or PPO for policy optimization and cross-entropy for discriminator updates (Wang et al., 2021).

In context-based IRL, the encoder is a two-layer MLP mapping demonstrations to the latent variable; mutual information regularization is necessary to preserve task-relevant variation (Yu et al., 2019).

5. Empirical Results and Benchmark Performance

Meta-AIRL methods demonstrate sample efficiency and generalization across both synthetic and realistic domains.

MandRIL (SpriteWorld/SUNCG):
- On SpriteWorld (navigation from raw pixels), MandRIL achieves low expected value difference (VD) with as few as 3–5 demonstrations, surpassing MaxEnt IRL and black-box meta-learner baselines which require at least 10 demos. This holds even for tasks involving unseen colors and shapes (Xu et al., 2018).
- In SUNCG (first-person navigation/manipulation), MandRIL attains meta-test success rates of 77.3% (test) and 82.6% (unseen-houses) with only 5 demos, outperforming scratch IRL and pre-training baselines.
Meta-AIRL/PEMIRL (Control domains):
- On continuous control (Point-Maze, Ant, Sawyer-Pusher), PEMIRL enables one-shot reward inference and policy adaptation, matching or exceeding Meta-Imitation and InfoGAIL. In novel dynamics (e.g. Ant with disabled legs), immediate adaptation outperforms imitation-only learners, demonstrating robustness to environmental changes (Yu et al., 2019).
Meta-AIRL with Adversarial IRL (Driving):
- In lane-change scenarios, Meta-AIRL reaches expert-level reward and near 100% success rate with ≈10 demo trajectories for unseen “aggressive” styles. Baselines require 3–5× more data. Kinematic metric histograms closely match the target expert distributions, with significantly lower L₁ divergence than pretrain+finetune or scratch AIRL (Wang et al., 2021).

6. Conceptual Insights and Inductive Biases

Meta-AIRL enforces a “prior” over reward functions (and, in some cases, policies) by optimizing for rapid IRL adaptation from limited data. In MandRIL, learning $\theta$ such that several gradient steps of IRL suffice to explain new demonstrations places the network initialization in a “flat valley” of the IRL loss landscape for plausible reward functions. This implements a Gaussian-like prior and prevents overfitting to spurious rewards induced by underspecified, few-shot data (Xu et al., 2018).

Latent context methods (PEMIRL) induce a context space that clusters task intent and supports amortized inference; mutual information regularization ensures that context variables are informative for downstream adaptation (Yu et al., 2019). Recovery of reward structure, and not mere behavior cloning, ensures robustness to changes in environment dynamics at meta-test time—a critical advantage over imitation-only baselines.

Meta-AIRL frameworks thus provide a principled solution to few-shot reward inference and policy generalization in IRL, crucial for scalable learning from demonstration in the presence of distributional shift and sparse data.

7. Comparative Summary and Significance

Meta-AIRL defines a meta-learning-enhanced class of IRL methods capable of fast and robust imitation from very limited expert supervision. The foundational attribute distinguishing these approaches from classic IRL is the explicit use of prior multi-task experience—whether realized as a learned initialization for gradient-based adaptation (MandRIL), a latent embedding space (PEMIRL), or coupled policy and reward parameterization with adversarial training (Meta-AIRL by (Wang et al., 2021))—optimized to support few-shot inference. Experimental results across vision-based navigation, continuous control, and complex decision-making indicate that meta-learned priors/contexualization lead to rapid adaptation, improved sample efficiency, and more robust reward inference compared to pre-training and imitation-only strategies (Xu et al., 2018, Wang et al., 2021, Yu et al., 2019).