Fast bandit last-iterate convergence without the A2L reduction (AOG-based dynamics)

Ascertain whether uncoupled learning dynamics constructed from existing algorithms such as Accelerated Optimistic Gradient (AOG), without relying on the A2L reduction, can be designed and analyzed to achieve fast last-iterate convergence rates under bandit (payoff-based) feedback in multi-player zero-sum polymatrix games.

Background

The paper demonstrates that its A2L reduction, combined with OMWU and a bandit utility-estimation procedure, achieves a high-probability Õ(T^{-1/5}) last-iterate convergence rate under bandit feedback. In contrast, adapting AOG with similar estimation ideas led only to Õ(T^{-1/8}) rates in prior work, largely due to sensitivity of potential-function-based analyses to estimation error.

Motivated by this gap, the authors pose whether one can obtain fast last-iterate convergence under bandit feedback using established algorithms like AOG, but without relying on the A2L reduction.

References

Nevertheless, it remains an interesting open question whether one can design uncoupled learning dynamics with fast convergence rates in the bandit feedback setting using existing algorithms like AOG without relying on the A2L reduction.

— From Average-Iterate to Last-Iterate Convergence in Games: A Reduction and Its Applications (2506.03464 - Cai et al., 4 Jun 2025) in Section 6 (Learning in Zero-Sum Games with Bandit Feedback) — Discussion

Fast bandit last-iterate convergence without the A2L reduction (AOG-based dynamics)

Sponsor

Background

References

Related Problems