Achieving sqrt-regret for low V-type Bellman Eluder dimension
Establish whether there exists a reinforcement learning algorithm that, under the assumption that the V-type Bellman Eluder (BE) dimension of the value-function class is finite, achieves O(√K) cumulative regret over K episodes; specifically, determine if techniques for low V-type Bellman rank can be adapted to the low V-type BE dimension setting to obtain √K-regret guarantees.
References
Dong et al. (2020) propose an algorithm that can achieve \sqrt{T}-regret for problems of low V-type Bellman rank. It is an interesting open problem to study whether similar techniques can be adapted to the low V-type BE dimension setting so that we can also obtain \sqrt{T}-regret.
— Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms
(2102.00815 - Jin et al., 2021) in Appendix A (V-type BE Dimension and Algorithms), after the theorem on V-type Golf