Extend continuous-state stochastic identifiability beyond unconditional policies

Establish whether the identifiability guarantees that uniquely recover the transition kernel P from the tuple (Q^π, π, r, γ) in stochastic Markov decision processes with continuous state spaces continue to hold when the policy π is goal-conditioned (i.e., depends on the goal), rather than being unconditional. Specifically, prove uniqueness of P from Q-values for goal-conditioned policies under the Gaussian-goal and indicator-goal settings studied in this paper, thereby generalising the unconditional-policy results for stochastic continuous MDPs.

Background

The paper introduces P-learning, a procedure for extracting a world model by inverting the Bellman equation using an agent’s learned Q-values, and proves sufficient conditions for unique identifiability of the transition kernel P from (Q, π, r, γ).

For continuous state spaces, the authors establish identifiability results under Gaussian and indicator goal families in deterministic MDPs and, in the stochastic case, under the restriction that policies are unconditional (i.e., independent of the goal).

They explicitly note that the intersection of stochastic kernels and general (goal-conditioned) policies is not addressed by their current theory, and they expect—but have not proved—that their stochastic continuous-state results should generalise to goal-conditioned policies. Hence, formalising and proving this generalisation remains an open problem.

References

First, our theoretical results for stochastic MDPs with continuous state spaces only apply to unconditional policies, and while we expect them to generalise, we leave them as open conjectures for future work.

Inverting the Bellman Equation: From $Q$-Values to World Models  (2606.21173 - Letcher et al., 19 Jun 2026) in Limitations, Section Conclusion