Multi-Task Inverse Reward Design (MIRD)

Updated 21 December 2025

MIRD is a framework that infers transferable reward functions by decomposing them into task-specific and common-sense components.
It employs methods like adversarial IRL, variational techniques, and latent context inference to robustly recover rewards across diverse tasks.
Empirical results in robotics and process control show improved transferability, efficiency, and safety compared to single-task approaches.

Multi-Task Inverse Reward Design (MIRD) encompasses a class of algorithms and frameworks for inferring robust, transferable reward functions from multiple task settings or reward sources, primarily within the inverse reinforcement learning (IRL) paradigm. The central motivation is to overcome reward function misspecification, reward hacking, and behavioral non-identifiability by leveraging multi-task structure—either in explicit tasks or in implicit modal variations—to recover reward components (such as common-sense or context-conditioned terms) that generalize across environments. Theoretical and empirical advances provide guarantees and practical effectiveness for a range of applications including robotics, process control, and safe AI design.

1. Problem Formulation and Core Principle

MIRD addresses the challenge of designing or inferring reward functions in the context of multiple related tasks or from conflicting/partial reward specifications. Let $\{M_t\}_{t=1}^T$ be a collection of Markov decision processes (MDPs) sharing state and action spaces $(\mathcal S, \mathcal A)$ , dynamics $P$ , and discount factor $\gamma$ , but differing in task $t$ by the (unknown) total reward $r_t(s,a,s')$ . MIRD frameworks posit that $r_t$ decomposes additively: $r_t(s,a,s') = \bar{r}_t(s,a,s') + r_{\mathrm{CS}}(s,a,s')$ where $\bar{r}_t$ is a simple, designed, task-specific reward and $r_{\mathrm{CS}}$ is a latent, shared “common-sense” or context-conditioned component (Glazer et al., 17 Feb 2024).

The explicit goal is to infer $r_{\mathrm{CS}}$ such that policies optimizing

$\bar{r}_t + r_{\mathrm{CS}}$

robustly replicate the set of expert demonstrations across all $T$ tasks and generalize to novel task instances or transitions. This principle extends to settings where the task or context may be latent, multimodal, or inferred from unlabelled demonstrations (Lin et al., 27 May 2025).

2. Methodological Frameworks

MIRD methods span several algorithmic instantiations:

a. Multi-Task Adversarial IRL

An AIRL-like setup parameterizes $r_{\mathrm{CS}} = f_\theta(s,a,s')$ with shared parameters across tasks, and constructs per-task discriminators: $D^t_\theta(s,a,s') = \frac{\exp(f_\theta(s,a,s')+\bar r_t(s,a,s'))}{\exp(f_\theta(s,a,s')+\bar r_t(s,a,s'))+\pi_{w_t}(a|s)}$ which guide alternate updates of $\theta$ and per-task policies $\pi_{w_t}$ (Glazer et al., 17 Feb 2024). The shared $f_\theta$ is forced by joint optimization to encode only those reward modalities aligned across all tasks.

b. Variational and Empowerment-regularized MIRD

Some MIRD variants, such as SEAIRL, augment adversarial IRL with situational empowerment, introducing an auxiliary mutual information term $I(a;s'|s,t)$ to disentangle reward from transition dynamics and encourage sub-task intent separation. The variational lower bound uses inverse models and potential functions, and the complete objective couples a task-conditioned reward function $R_\theta(s,a,t)$ , hierarchical policy $\pi_\phi(a|s,t)$ , and empowerment potential $\Phi_\varphi(s,t)$ in a joint actor–critic/generative adversarial architecture (Yoo et al., 2022).

c. Multi-Source Reward Fusion

Another axis involves probabilistic combination of reward functions learned from disparate sources (e.g., IRL, language, IRD). Here, MIRD constructs a posterior over reward parameters $\theta$ given two vectors $\phi^1, \phi^2$ : $p(\theta|\phi^1,\phi^2) \propto p(\theta) \, p(\phi^1|\theta) \, p(\phi^2|\theta)$ but instead of operating in reward space (which is fragile to likelihood misspecification), works in behavior space by synthesizing trajectories via stochastic mixing between sources, then re-fitting reward hypotheses using maximum-entropy IRL on the mixed data (Krasheninnikov et al., 2021).

d. Latent-Context Multi-Task IRL

For process control and related domains, a latent variable $z$ indexes tasks or modes, and both reward $r_\theta(x,u,z)$ and policy $\pi(u|x,z)$ are conditioned on $z$ . Mutual information objectives (cf. InfoGAN) and encoder networks $q_\psi(z|\tau)$ infer modes from demonstration data, enabling mode-specific controller and reward learning without explicit mode labels (Lin et al., 27 May 2025).

3. Theoretical Insights and Identifiability

A foundational theoretical observation is that reward learning from a single task is not identifiable: any reward spanned by a potential-based shaping transformation can explain the same policy. MIRD achieves identifiability by enforcing that only modalities consistent across all tasks persist in $r_{\mathrm{CS}}$ . If a feature of the state affects behavior in only one task, joint training will drive its coefficient in $r_{\mathrm{CS}}$ to zero (Glazer et al., 17 Feb 2024).

For the multi-source reward setting, MIRD ensures:

Convex-behavior property: Posterior samples yield behaviors that interpolate (convexly) between those preferred by each input source.
Strong informativeness: When input behaviors coincide, the posterior is informative about the common feature expectation.
Behavior-space balance: With symmetric priors, behaviors are balanced between sources (Krasheninnikov et al., 2021).

4. Algorithmic Implementations

High-level pseudocode for MIRD frameworks follows an alternating minimax structure:

For each task, collect expert demonstrations $\mathcal D_t$ and known task rewards $\bar r_t$ .
Alternate updates:
- Discriminator/reward function: Optimize parameters (e.g., $\theta$ in AIRL, $\xi$ in variational IRL) to better separate expert and agent trajectories across all tasks or modes.
- Policy (per task/mode): Optimize to maximize current combined reward estimates, possibly weighted or regularized by empowerment, mutual information, or behavior constraints.
- For latent variable frameworks: Update context encoders (e.g., $q_\psi$ ) to maximize mode distinguishability (Lin et al., 27 May 2025).
Repeat until convergence; deploy the shared reward/ $f_\theta$ , contextually tuned policy, or posterior ensemble as needed (Glazer et al., 17 Feb 2024, Yoo et al., 2022, Krasheninnikov et al., 2021, Lin et al., 27 May 2025).

Explicit pseudocode and architectural details for each main algorithm are given in the source papers.

5. Empirical Results and Applications

MIRD methods are evaluated on robotics, process control, and synthetic environments, with a consistent theme: multi-task structure enables recovery of reward functions (or ensembles) with superior transferability and behavioral robustness.

a. Robotics Benchmarks

In Meta-World (Sawyer manipulation tasks), MIRD recovers common-sense rewards for velocity or action-norm targets that transfer ( $>$ 90% alignment to target) to unseen tasks, outperforming single-task IRL and multi-task AIRL lacking explicit disentanglement:

Correlation (held-out tasks): 0.88 (velocity), 0.81 (action-norm) (Glazer et al., 17 Feb 2024).
Success rate (MT10, ML5): MIRD exhibits significantly higher final and sample-efficient performance than GAIL, DIGAIL, and EAIRL (Yoo et al., 2022).

b. Multi-Source Reward Tradeoff

In a toy “Cooking” environment, MIRD and its independent-features variant MIRD-IF maintain optimal behavioral tradeoffs, balance, and informativeness between conflicting reward sources. MIRD-IF additionally satisfies support on independently corrupted features, while vanilla MIRD is maximally informative when behaviors coincide (Krasheninnikov et al., 2021).

c. Multi-Mode Process Control

In continuous stirred tank reactor (CSTR) and fed-batch bioreactor studies, MIRD recovers context-conditioned rewards and policies from unlabelled, multi-mode demonstrations. Learned controllers match expert batch product targets and setpoint tracking error within a few percent, handling both DRL and traditional PI controller experts (Lin et al., 27 May 2025).

6. Limitations and Extensions

Key limitations include:

Persistent sample complexity: Realistic multi-task settings require large numbers of expert demonstrations per mode/task (Lin et al., 27 May 2025).
Practical identifiability is not formally proven for all variants; unique recovery depends on sufficient task diversity (Glazer et al., 17 Feb 2024).
In the reward-combination setting, coverage over all “independently corrupted” features is only attained by MIRD-IF, not standard MIRD (Krasheninnikov et al., 2021).
Adversarial training is sensitive to hyperparameters and requires careful regularization (RKHS/ $L_2$ , entropy, empowerment).

Ongoing and prospective extensions include:

Hierarchical latent structure for sub-task/sub-mode inference.
Integration of language or side information for richer task reward shaping (Glazer et al., 17 Feb 2024).
Meta-IRL and continuous context MIRD for broad task families (Yoo et al., 2022).
Partially observed or multi-agent extensions (e.g., RNN encoders, shared global context) (Lin et al., 27 May 2025).

7. Significance and Research Directions

MIRD provides a principled route to isolating transferable, robust reward components from multi-task data or complex, multimodal sources. It is positioned as a scalable path toward safe and generalizable agent specification and planning, mitigating reward misspecification, and enabling sample-efficient transfer in high-dimensional real-world settings. Active research explores generalization, theoretical guarantees, and broader integration with other specification frameworks.

References:

"Multi Task Inverse Reinforcement Learning for Common Sense Reward" (Glazer et al., 17 Feb 2024)
"Learning Multi-Task Transferable Rewards via Variational Inverse Reinforcement Learning" (Yoo et al., 2022)
"Combining Reward Information from Multiple Sources" (Krasheninnikov et al., 2021)
"Multi-Mode Process Control Using Multi-Task Inverse Reinforcement Learning" (Lin et al., 27 May 2025)