Papers
Topics
Authors
Recent
2000 character limit reached

Influencing Bandits: Arm Selection for Preference Shaping (2403.00036v1)

Published 29 Feb 2024 in cs.LG, cs.AI, cs.IR, cs.SY, and eess.SY

Abstract: We consider a non stationary multi-armed bandit in which the population preferences are positively and negatively reinforced by the observed rewards. The objective of the algorithm is to shape the population preferences to maximize the fraction of the population favouring a predetermined arm. For the case of binary opinions, two types of opinion dynamics are considered -- decreasing elasticity (modeled as a Polya urn with increasing number of balls) and constant elasticity (using the voter model). For the first case, we describe an Explore-then-commit policy and a Thompson sampling policy and analyse the regret for each of these policies. We then show that these algorithms and their analyses carry over to the constant elasticity case. We also describe a Thompson sampling based algorithm for the case when more than two types of opinions are present. Finally, we discuss the case where presence of multiple recommendation systems gives rise to a trade-off between their popularity and opinion shaping objectives.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends in Machine Learning, vol. 5, no. 1, 2012.
  2. A. Slivkins, “Introduction to multi-armed bandits,” 2019.
  3. J. Langford and T. Zhang, “The epoch-greedy algorithm for contextual multi-armed bandits,” in Advances in Neural Information Processing Systems.   NIPS, Dec. 2007, pp. 1–8.
  4. V. S. Borkar, J. Nair, and N. Sanketh, “Manufacturing consent,” IEEE Transactions on Automatic Control, vol. 60, no. 1, pp. 104–117, January 2015.
  5. S. Eshghi, V. Preciado, S. Sarkar, S. Venkatesh, Q. Zhao, R. D’Souza, and A. Swami, “Spread, then target, and advertise in waves: Optimal budget allocation across advertising channels,” IEEE Transactions on Network Science and Engineering, vol. 7, no. 2, pp. 750–763, October 2018.
  6. M. Goyal, D. Chatterjee, N. Karamchandani, and D. Manjunath, “Maintaining ferment,” in Proceedings of IEEE CDC, 2019, pp. 5217–5222.
  7. K. Palda, “The measurement of cumulative advertising effects,” The Journal of Business, vol. 38, pp. 162–179, 1965.
  8. D. Cowling, M. Modayil, and C. Stevens, “Assessing the relationship between ad volume and awareness of a tobacco education media campaign,” Tobacco Control, 2010. ncbi.nlm.nih.gov/pmc/articles/PMC2976530/pdf/tc030692.pdf
  9. R. A. Holley and T. M. Liggett, “Ergodic theorems for weakly interacting infinite systems and the voter model,” Annals of Probability, vol. 3, no. 4, pp. 643–663, August 1975.
  10. C. C. Wang, S. Kulkarni, and H. V. Poor, “Bandit problems with side observations,” IEEE Transactions on Automatic Control, vol. 50, no. 3, pp. 338–355, 2005.
  11. O. Besbes, Y. Gur, and A. Zeevi, “Stochastic multi-armed bandit problem with non-stationary rewards,” Stochastic Systems, vol. 9, no. 4, pp. 319–416, December 2019.
  12. N. Levine, K. Crammer, and S. Mannor, “Rotting bandits,” in Neural Information Processing Systems, 2017, pp. 3077–3086.
  13. N. Immorlica and R. D. Kleinberg, “Recharging bandits,” in Foundations of Computer Science, 2018.
  14. R. Meshram, D. Manjunath, and A. Gopalan, “On the whittle index for restless multiarmed hidden markov bandits,” IEEE Transactions on Automatic Control, vol. 63, no. 9, pp. 3046–3053, September 2018.
  15. R. Meshram, A. Gopalan, and D. Manjunath, “Optimal recommendation to users that react: Online learning for a class of pomdps,” in Proceedings of IEEE CDC, 2016.
  16. T. Fiez, S. Sekar, and L. J. Ratliff, “Multi-armed bandits for correlated markovian environments with smoothed reward feedback,” 2019.
  17. V. Shah, J. Blanchet, and R. Johari, “Bandit learning with positive externalities,” in Proceedings of NeurIPS, 2018. https://arxiv.org/pdf/1802.05693.pdf
  18. Q. Wu, N. Iyer, and H. Wang, “Learning contextual bandits in a non-stationary environment,” in Proceedings of the 41st ACM SIGIR Conference, June 2018, pp. 495–504. https://doi.org/10.1145/3209978.3210051
  19. G. Pólya, “Sur quelques points de la théorie des probabilités,” Annals of Inst. H. Poincaré, vol. 1, no. 2, pp. 117–161, 1930.
  20. R. Pemantle, “A survey of random processes with reinforcement,” Probabability Surveys, vol. 4, pp. 1–79, 2007.
  21. H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, vol. 22, no. 3, 1951.
  22. B. Kumar, V. S. Borkar, and A. Shetty, “Bounds for tracking error in constant stepsize stochastic approximation,” 2018. arXiv:1802.07759

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.