Adaptive weighted algorithm achieving optimal dynamic regret without prior knowledge of path length in non-stationary linear bandits
Develop an adaptive weight-based algorithm for non-stationary linear bandits that attains the optimal dynamic regret rate without requiring prior knowledge of the path length P_T by adaptively tuning the discount factor γ_t in real time, despite only observing a single data pair (X_t, r_t) per round.
References
However, this bound is not optimal, and it is possible to design an adaptive weight-based algorithm based on our result, in the spirit of, to further achieve an optimal dynamic regret without prior knowledge of $P_T$. We leave this as an important open question for future study.
— Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs
(2601.01069 - Wang et al., 3 Jan 2026) in Section 3.2 (Algorithm and Regret Guarantee), after Theorem 1