Extend continuous-state stochastic identifiability beyond unconditional policies
Establish whether the identifiability guarantees that uniquely recover the transition kernel P from the tuple (Q^π, π, r, γ) in stochastic Markov decision processes with continuous state spaces continue to hold when the policy π is goal-conditioned (i.e., depends on the goal), rather than being unconditional. Specifically, prove uniqueness of P from Q-values for goal-conditioned policies under the Gaussian-goal and indicator-goal settings studied in this paper, thereby generalising the unconditional-policy results for stochastic continuous MDPs.
References
First, our theoretical results for stochastic MDPs with continuous state spaces only apply to unconditional policies, and while we expect them to generalise, we leave them as open conjectures for future work.