Incorporating state-visitation into the zooming dimension

Develop a formal, instance-dependent complexity measure for episodic finite-horizon Markov decision processes with continuous state and action spaces under Lipschitz assumptions that modifies the step-wise zooming dimension z_h—currently defined via N_r(Z_h^r) where Z_h^r = {(x,a) ∈ S × A : gap_h(x,a) ≤ C(H+1) r}—to explicitly account for the distribution of states visited by the optimal policy. Establish a rigorous definition and analysis framework that reflects the empirical observation that adaptive discretization refines only regions of the state space that the optimal policy visits and thereby captures potential improvements in the effective state-space dimension.

Background

The paper introduces the zooming dimension as an instance-dependent complexity measure for continuous-state, continuous-action reinforcement learning, defined through packing numbers of near-optimal state–action pairs. This measure yields instance-dependent regret bounds that can be significantly better than those based on the ambient dimension.

However, the current definition necessarily scales with at least the dimension of the state space and does not reflect the distribution of states actually visited by the optimal policy. The authors note that, empirically, adaptive discretization primarily refines regions the optimal policy visits, but they lack a formal way to incorporate this intuition into the zooming dimension.

A refined notion that integrates state-visitation frequencies could potentially yield sharper, instance-specific guarantees by capturing reductions in the effective complexity of the state space beyond what the present zooming dimension permits.

References

Even in the simpler contextual multi-armed bandit model the zooming dimension necessarily scales with the dimension of the context space, regardless of the support or mass over the context space the context distribution places. While analytically we cannot show gains in the state space dimension, we see empirically in \cref{sec:experiments} that the algorithms only cover the state space in regions the optimal policy visits, but it is unclear how to include this intuition formally in the definition. Revisiting new notions of ``instance specific'' complexity is an interesting direction for future work in both tabular and continuous RL.

Adaptive Discretization in Online Reinforcement Learning  (2110.15843 - Sinclair et al., 2021) in Subsection 2.4 (Zooming Dimension)