Achieving sqrt-regret for low V-type Bellman Eluder dimension

Establish whether there exists a reinforcement learning algorithm that, under the assumption that the V-type Bellman Eluder (BE) dimension of the value-function class is finite, achieves O(√K) cumulative regret over K episodes; specifically, determine if techniques for low V-type Bellman rank can be adapted to the low V-type BE dimension setting to obtain √K-regret guarantees.

Background

The paper introduces Bellman Eluder (BE) dimension as a general complexity measure for reinforcement learning with function approximation, with Q-type and V-type variants. The authors provide both regret and sample-complexity guarantees for low Q-type BE dimension, but for the V-type case they only obtain PAC-style sample-complexity results due to the need for uniform random action selection in their V-type algorithms.

In contrast, prior work (Dong et al., 2020) obtained √T-regret for problems with low V-type Bellman rank. This raises the question of whether similar regret guarantees can be established for the broader class characterized by low V-type BE dimension.

References

Dong et al. (2020) propose an algorithm that can achieve \sqrt{T}-regret for problems of low V-type Bellman rank. It is an interesting open problem to study whether similar techniques can be adapted to the low V-type BE dimension setting so that we can also obtain \sqrt{T}-regret.

Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms  (2102.00815 - Jin et al., 2021) in Appendix A (V-type BE Dimension and Algorithms), after the theorem on V-type Golf