Adaptive weighted algorithm achieving optimal dynamic regret without prior knowledge of path length in non-stationary linear bandits

Develop an adaptive weight-based algorithm for non-stationary linear bandits that attains the optimal dynamic regret rate without requiring prior knowledge of the path length P_T by adaptively tuning the discount factor γ_t in real time, despite only observing a single data pair (X_t, r_t) per round.

Background

The paper revisits weighted strategies for non-stationary parametric bandits and shows improved bounds for several models. For linear bandits, the proposed algorithm can be combined with the BOB strategy to handle unknown P_T but remains suboptimal compared to adaptive restart-based methods. The authors note that achieving optimal dynamic regret without prior knowledge of P_T would require an adaptive weight-based scheme that adjusts γ_t online, which is challenging because the learner receives only one data pair per round and γ_t ranges continuously in [0,1], unlike binary restart decisions.

This open question points to closing the gap between the practical advantages of weighted strategies in gradually drifting environments and the theoretical optimality achieved by adaptive restart strategies such as MASTER, but without relying on prior non-stationarity information.

References

However, this bound is not optimal, and it is possible to design an adaptive weight-based algorithm based on our result, in the spirit of, to further achieve an optimal dynamic regret without prior knowledge of $P_T$. We leave this as an important open question for future study.

Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs  (2601.01069 - Wang et al., 3 Jan 2026) in Section 3.2 (Algorithm and Regret Guarantee), after Theorem 1