Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards (2403.11520v1)

Published 18 Mar 2024 in cs.LG and stat.ML

Abstract: While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2):235–256.
  2. Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in neural information processing systems, 27.
  3. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1–8. IEEE.
  4. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24.
  5. Survey of multiarmed bandit algorithms applied to recommendation systems. 9(4):16.
  6. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. In Proceedings of the 24th Annual Conference on Learning Theory, pages 359–376. JMLR Workshop and Conference Proceedings. ISSN: 1938-7228.
  7. On upper-confidence bound policies for non-stationary bandit problems. In Algorithmic Learning Theory, pages 174–188.
  8. Tight policy regret bounds for improving and decaying bandits. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 1562–1570.
  9. Recharging bandits. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 309–319. IEEE.
  10. A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 971–990. PMLR. ISSN: 2640-3498.
  11. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22.
  12. Bandit Algorithms. Cambridge University Press, 1 edition.
  13. Rebounding Bandits for Modeling Satiation Effects. arXiv:2011.06741 [cs, stat].
  14. Rotting bandits. Advances in neural information processing systems, 30.
  15. Efficient automatic cash via rising bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4763–4771.
  16. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  17. Stochastic rising bandits. In International Conference on Machine Learning, pages 15421–15457. PMLR.
  18. Dynamic Online Pricing with Incomplete Information Using Multiarmed Bandit Experiments. Marketing Science, 38(2):226–252.
  19. Fatigue-aware ad creative selection. arXiv preprint arXiv:1908.08936.
  20. Recovering Bandits. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  21. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  22. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK.
  23. Weighted linear bandits for non-stationary environments. Advances in Neural Information Processing Systems, 32.
  24. Rotting bandits are no harder than stochastic ones. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pages 2564–2572. PMLR. ISSN: 2640-3498.
  25. Dynamic Planning and Learning under Recovering Rewards. In Proceedings of the 38th International Conference on Machine Learning, pages 9702–9711. PMLR. ISSN: 2640-3498.
  26. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38:287–308.
  27. Reinforcement learning: An introduction. MIT press.
  28. Thompson, W. R. (1933). On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3/4):285–294. Publisher: [Oxford University Press, Biometrika Trust].
  29. Fighting Boredom in Recommender Systems with Linear Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  30. Q-learning. Machine learning, 8:279–292.
  31. A Sleeping, Recovering Bandit Algorithm for Optimizing Recurring Notifications. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3008–3016, Virtual Event CA USA. ACM.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com