Incorporating state-visitation into the zooming dimension
Develop a formal, instance-dependent complexity measure for episodic finite-horizon Markov decision processes with continuous state and action spaces under Lipschitz assumptions that modifies the step-wise zooming dimension z_h—currently defined via N_r(Z_h^r) where Z_h^r = {(x,a) ∈ S × A : gap_h(x,a) ≤ C(H+1) r}—to explicitly account for the distribution of states visited by the optimal policy. Establish a rigorous definition and analysis framework that reflects the empirical observation that adaptive discretization refines only regions of the state space that the optimal policy visits and thereby captures potential improvements in the effective state-space dimension.
References
Even in the simpler contextual multi-armed bandit model the zooming dimension necessarily scales with the dimension of the context space, regardless of the support or mass over the context space the context distribution places. While analytically we cannot show gains in the state space dimension, we see empirically in \cref{sec:experiments} that the algorithms only cover the state space in regions the optimal policy visits, but it is unclear how to include this intuition formally in the definition. Revisiting new notions of ``instance specific'' complexity is an interesting direction for future work in both tabular and continuous RL.