Convergence of nested MDP policy to I-POMDP policy as sensing and actuation accuracy improves

Prove that as observation and transition models become more accurate (i.e., as the parameter ε decreases), the level-0 nested MDP policy for the other agent converges to or closely approximates the exact level-0 I-POMDP policy, thereby formally establishing nested MDP as an effective surrogate for I-POMDP under fine sensing and actuation capabilities.

Background

The paper presents Theorem 1 (labelled Theorem~\ref{thm:0}) bounding the difference in n-step Q-values between the other agent’s MDP (nested MDP at level 0) and POMDP (I-POMDP at level 0) under high sensing and actuation accuracy. Building on this quantitative bound, the authors conjecture a stronger policy-level approximation as sensing improves.

Establishing policy convergence would justify the core structural assumption of I-POMDP Lite—that the other agent’s intention is driven by nested MDP—and provide theoretical foundations for using nested MDP as a surrogate for the exact I-POMDP policy in practical multi-agent planning.

References

Following from Theorem~\ref{thm:0}, we conjecture that, as $\epsilon$ decreases (i.e., observation and transition models become more accurate), the nested MDP policy ${\pi}{0}_{\mbox{-}t}$ of the other agent is more likely to approximate the exact I-POMDP policy closely.