Multitask Inverse Reward Design (MIRD)
- Multitask Inverse Reward Design (MIRD) is a framework that aggregates reward signals from multiple sources to infer robust and transferable reward functions in reinforcement learning.
- It employs behavioral mixtures with theoretical guarantees to balance informativeness and safety in policy planning under reward misspecification.
- Extensions of MIRD support reward decomposition and transferable learning across tasks, enhancing generalization and mitigating reward hacking challenges.
Multitask Inverse Reward Design (MIRD) is a suite of algorithmic and theoretical frameworks for aggregating reward information from multiple sources or tasks under uncertainty, typically in the context of reinforcement learning (RL) when the reward function is underspecified or potentially misspecified. MIRD addresses the challenge of obtaining reward functions that are robust to conflicting specifications, transferable across tasks, and informative for safe and effective planning. Its main methodological contributions center on combining evidence over reward parameters at the behavioral, rather than parametric, level and providing actionable guarantees about regret and support within the induced posterior over rewards.
1. Formal Definition and Problem Setting
Multitask Inverse Reward Design operates in the Markov Decision Process (MDP) framework. Let be the state space, the action space, the transition kernel, and the temporal discount factor. The central premise is that the true reward function is latent, with only indirect evidence available through expert demonstrations, language specifications, prior rewards, or other perturbable proxies. In the standard setting, the reward is assumed linear in features: for feature map and parameter .
MIRD generalizes classic Inverse Reward Design (IRD) to settings in which:
- There are multiple sources of reward information, each potentially misspecified or in partial conflict.
- The agent must plan under a posterior induced not by one but by all reward sources, which can include hand-designed reward functions, observed demonstrations, or auxiliary inputs.
- The desired posterior should embody robustness (preserving plausible reward interpretations under model corruption), informativeness (admitting desired behaviors when sources agree), and balance (avoiding overcommitment when sources conflict) (Krasheninnikov et al., 2021).
Complementary variants of MIRD operate when the evidence consists of expert demonstrations across different tasks, aiming to infer either a decomposed structure (e.g., common-sense vs. task-specific reward) (Glazer et al., 17 Feb 2024), or a latent multi-task/class reward suitable for transfer and robust generalization (Yoo et al., 2022).
2. Desiderata for Combining Reward Evidence
MIRD is motivated by four central desiderata for reward aggregation:
- Robust Support on Feature Grid: The posterior should include all combinations of per-feature reward weights from the input sources, supporting settings in which features may be corrupted independently.
- Support on Intermediate Tradeoffs: If sources disagree on the ratio of two feature weights, the posterior should admit all intermediate ratios, enabling learning of plausible tradeoffs.
- Informativeness for Desirable Behavior: When two sources yield the same optimal behavior in the training environment, should concentrate on reward parameters that reproduce that behavior. MIRD formalizes this as strong/medium/weak informativeness, depending on whether the match is elementwise, global, or exists over a specific subset.
- Behavior-space Balance: When sources induce conflicting policies, the posterior over feature expectations (for optimal under ) should be balanced, preserving diversity and avoiding overcommitment to any single specification (Krasheninnikov et al., 2021).
These desiderata explicitly delineate the tradeoffs—conservatism in the face of conflict versus informativeness when models agree—that MIRD aims to balance, especially crucial when downstream planning uses policy regularization methods such as Attainable Utility Preservation.
3. Algorithmic Construction and Theoretical Guarantees
MIRD employs a generative mixture-model approach in behavior space:
- Generative Process:
- Sample a mixing coefficient .
- Generate a synthetic demonstration set by sampling each trajectory from the optimal/soft policy for either source $1$ (with probability ) or source $2$ (with $1-b$).
- Recover by solving Maximum Causal Entropy IRL on this mixed demonstration set, ensuring the induced policy's feature expectations are an exact convex combination: , where are the feature expectations of the soft-optimal policies from each source.
- Posterior Characterization:
- Theoretical Guarantees:
- Support: Every in the posterior yields policy feature expectations on the line segment between and .
- Worst-Case Regret Bound: for any (true) , ensuring that, in the worst case, the planner is no worse than choosing the worst source (Krasheninnikov et al., 2021).
- Balance and Informativeness: For symmetric Beta priors (), the posterior exactly balances between the two behavior modes, and, when the feature expectations coincide, the support is confined to reward parameters that yield that shared behavior.
A key variant, MIRD-IF, employs a Dirichlet mixture over the entire feature grid for full per-feature corruption coverage.
| Variant | Support on full grid | Intermediate tradeoffs | Strong informativeness | Balance |
|---|---|---|---|---|
| MIRD | No | No | Yes | Yes |
| MIRD-IF | Yes | Yes | Partly | Yes |
| CC-Grid | Yes | Yes | No | No |
4. Extensions in Multi-task IRL and Transferable Reward Learning
Recent work generalizes MIRD to settings with multiple tasks or subtasks:
- Reward Decomposition (Glazer et al., 17 Feb 2024): The true reward $R_t(s,a,s') = R_\text{task}_t(s,a,s') + R_\text{common}(s,a,s')$, with $R_\text{task}_t$ known and latent, shared across tasks. Using Multi-task AIRL-style discriminators with shared parameters, the framework alternates between discriminator and policy updates, optimizing a total loss over tasks. This disentangles “what” the agent should do (task) from “how” (common-sense), addresses reward hacking, and achieves transfer of across unseen tasks.
- Variational Empowerment-based Multitask IRL (Yoo et al., 2022): Here, a latent code indexes subtasks; the objective includes a variational lower bound on the “situational empowerment” (mutual information ), ensuring the learned reward captures reusable, robust sub-policy components transferable to new dynamics or tasks. Empirical results on diverse benchmarks show improved transfer and robustness relative to baselines.
5. Representative Empirical Evaluations
Empirical validation of MIRD and its variants focus on evaluating robustness to misspecification, support diversity, informativeness, and safe planning:
- Toy gridworld (“Cooking” environment) (Krasheninnikov et al., 2021): MIRD and MIRD-IF are tested with Attainable Utility Preservation planning on feature counts (e.g., flour, dough, cake), evaluating support on the grid, informativeness when sources agree, and balancing behaviors when conflicted. MIRD achieves strong informativeness and balanced behavior-space support. MIRD-IF uniquely covers the full feature grid.
- Meta-World V2 manipulation tasks (Glazer et al., 17 Feb 2024): MIRD’s reward decomposition approach recovers transferable common-sense rewards (velocity, action-norm) that generalize to unseen tasks, outperforming single-task IRL both quantitatively (e.g., achieving 0.69 velocity ratio on new tasks vs. 0.32 for SAC) and in behavioral quality (avoiding task-specific reward hacking).
6. Practical Guidelines and Limitations
Guidance on use-cases includes:
- Use classic MIRD when robust informativeness and behavior-space balance are needed, even if support is restricted to convex combinations.
- Use MIRD-IF to cover all plausible per-feature combinations or when planning must account for independent feature corruptions or active reward queries.
- Avoid naive aggregate reward-space methods (additive, Gaussian) in high-conflict or high-misspecification scenarios.
Principal limitations:
- Full identification of latent reward structure requires diverse demonstrations or evidence; poor task diversity undermines identifiability (Glazer et al., 17 Feb 2024).
- MIRD presumes known feature maps and, for the decomposition approach, known task rewards.
- Empirical scaling to complex common-sense reward and richer environments remains an open challenge; current formal guarantees do not yield finite-sample rates.
7. Connections and Context
MIRD establishes foundational links between Inverse Reward Design, Inverse Reinforcement Learning, and safe RL planning under distributional and epistemic uncertainty. Its emphasis on behavioral mixtures and regret guarantees addresses central concerns in safe deployment of RL: reward misspecification, overfitting, and fragility under novel compositions of reward signal. Extensions incorporating latent task codes (Yoo et al., 2022) and reward disentanglement (Glazer et al., 17 Feb 2024) embed MIRD principles within the broader literature on multi-task and transferable RL, underscoring the versatility and theoretical soundness of behavior-centric aggregation of reward information.