Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarial Bandits with Multi-User Delayed Feedback: Theory and Application (2310.11188v2)

Published 17 Oct 2023 in cs.LG

Abstract: The multi-armed bandit (MAB) models have attracted significant research attention due to their applicability and effectiveness in various real-world scenarios such as resource allocation, online advertising, and dynamic pricing. As an important branch, the adversarial MAB problems with delayed feedback have been proposed and studied by many researchers recently where a conceptual adversary strategically selects the reward distributions associated with each arm to challenge the learning algorithm and the agent experiences a delay between taking an action and receiving the corresponding reward feedback. However, the existing models restrict the feedback to be generated from only one user, which makes models inapplicable to the prevailing scenarios of multiple users (e.g. ad recommendation for a group of users). In this paper, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution. In contrast, the feedback delay is arbitrary and unknown to the player in advance. Also, for different users in a round, the delays in feedback have no assumption of latent correlation. Thus, we formulate an adversarial MAB problem with multi-user delayed feedback and design a modified EXP3 algorithm MUD-EXP3, which makes a decision at each round by considering the importance-weighted estimator of the received feedback from different users. On the premise of known terminal round index $T$, the number of users $M$, the number of arms $N$, and upper bound of delay $d_{max}$, we prove a regret of $\mathcal{O}(\sqrt{TM2\ln{N}(N\mathrm{e}+4d_{max})})$. Furthermore, for the more common case of unknown $T$, an adaptive algorithm AMUD-EXP3 is proposed with a sublinear regret with respect to $T$. Finally, extensive experiments are conducted to indicate the correctness and effectiveness of our algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Y. Li and J. Guo, “A modified exp3 in adversarial bandits with multi-user delayed feedback,” in Computing and Combinatorics: 29th International Conference, COCOON 2023, Hawaii, USA, December 15–17, 2023, Proceedings.   Springer Nature, 2023.
  2. R. Zhou, X. Zhang, S. Qin, J. C. Lui, Z. Zhou, H. Huang, and Z. Li, “Online task offloading for 5g small cell networks,” IEEE Transactions on Mobile Computing, vol. 21, no. 6, pp. 2103–2115, 2020.
  3. Y. Han, L. Ai, R. Wang, J. Wu, D. Liu, and H. Ren, “Cache placement optimization in mobile edge computing networks with unaware environment—an extended multi-armed bandit approach,” IEEE Transactions on Wireless Communications, vol. 20, no. 12, pp. 8119–8133, 2021.
  4. Y. Chen, S. Zhang, Y. Jin, Z. Qian, M. Xiao, N. Chen, and Z. Ma, “Learning for crowdsourcing: Online dispatch for video analytics with guarantee,” in IEEE INFOCOM 2022-IEEE Conference on Computer Communications.   IEEE, 2022, pp. 1908–1917.
  5. W. Liu, S. Li, and S. Zhang, “Contextual dependent click bandit algorithm for web recommendation,” in Computing and Combinatorics: 24th International Conference, COCOON 2018, Qing Dao, China, July 2-4, 2018, Proceedings 24.   Springer, 2018, pp. 39–50.
  6. V. Avadhanula, R. Colini Baldeschi, S. Leonardi, K. A. Sankararaman, and O. Schrijvers, “Stochastic bandits for multi-platform budget optimization in online advertising,” in Proceedings of the Web Conference 2021, 2021, pp. 2805–2817.
  7. Z. Wan, X. Sun, and J. Zhang, “Bounded memory adversarial bandits with composite anonymous delayed feedback,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, 2022, pp. 3501–3507.
  8. S. Masoudian, J. Zimmert, and Y. Seldin, “A best-of-both-worlds algorithm for bandits with delayed feedback,” in Advances in Neural Information Processing Systems, 2022.
  9. I. Bistritz, Z. Zhou, X. Chen, N. Bambos, and J. Blanchet, “Online exp3 learning in adversarial bandits with delayed feedback,” Advances in neural information processing systems, vol. 32, 2019.
  10. N. Cesa-Bianchi, C. Gentile, Y. Mansour et al., “Delay and cooperation in nonstochastic bandits,” Journal of Machine Learning Research, vol. 20, no. 17, pp. 1–38, 2019.
  11. T. S. Thune, N. Cesa-Bianchi, and Y. Seldin, “Nonstochastic multiarmed bandits with unrestricted delays,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  12. Q. Yuan, G. Cong, and C.-Y. Lin, “Com: a generative model for group recommendation,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 163–172.
  13. Y. Li, J. Guo, Y. Li, T. Wang, and W. Jia, “An online resource scheduling for maximizing quality-of-experience in meta computing,” arXiv preprint arXiv:2304.13463, 2023.
  14. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a rigged casino: The adversarial multi-armed bandit problem,” in Proceedings of IEEE 36th annual foundations of computer science.   IEEE, 1995, pp. 322–331.
  15. N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth, “How to use expert advice,” Journal of the ACM (JACM), vol. 44, no. 3, pp. 427–485, 1997.
  16. P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, pp. 235–256, 2002.
  17. E. Kaufmann and S. Kalyanakrishnan, “Information complexity in bandit subset selection,” in Conference on Learning Theory.   PMLR, 2013, pp. 228–251.
  18. H. Robbins, “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society, vol. 58, no. 5, pp. 527–535, 1952.
  19. S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in Conference on learning theory.   JMLR Workshop and Conference Proceedings, 2012, pp. 39–1.
  20. J.-Y. Audibert and S. Bubeck, “Minimax policies for adversarial and stochastic bandits.” in COLT, vol. 7, 2009, pp. 1–122.
  21. O.-A. Maillard, R. Munos, and G. Stoltz, “A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences,” in Proceedings of the 24th annual Conference On Learning Theory.   JMLR Workshop and Conference Proceedings, 2011, pp. 497–514.
  22. E. Kaufmann, O. Cappé, and A. Garivier, “On bayesian upper confidence bounds for bandit problems,” in Artificial intelligence and statistics.   PMLR, 2012, pp. 592–600.
  23. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM journal on computing, vol. 32, no. 1, pp. 48–77, 2002.
  24. M. Zoghi, S. Whiteson, R. Munos, and M. Rijke, “Relative upper confidence bound for the k-armed dueling bandit problem,” in International conference on machine learning.   PMLR, 2014, pp. 10–18.
  25. L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proceedings of the 19th international conference on World wide web, 2010, pp. 661–670.
  26. R. Allesiardo, R. Féraud, and D. Bouneffouf, “A neural networks committee for the contextual bandit problem,” in Neural Information Processing: 21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3-6, 2014. Proceedings, Part I 21.   Springer, 2014, pp. 374–381.
  27. S. Agrawal and N. Goyal, “Thompson sampling for contextual bandits with linear payoffs,” in International conference on machine learning.   PMLR, 2013, pp. 127–135.
  28. M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini, “Finite-time analysis of kernelised contextual bandits,” in Uncertainty in Artificial Intelligence, 2013.
  29. M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang, “Efficient optimal learning for contextual bandits,” in Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011, 2011, p. 169.
  30. P. Joulani, A. Gyorgy, and C. Szepesvári, “Online learning under delayed feedback,” in International Conference on Machine Learning.   PMLR, 2013, pp. 1453–1461.
  31. ——, “Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, 2016.
  32. C. Vernade, O. Cappé, and V. Perchet, “Stochastic bandit models for delayed conversions,” in Conference on Uncertainty in Artificial Intelligence, 2017.
  33. O. Chapelle, “Modeling delayed feedback in display advertising,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 1097–1105.
  34. M. A. Gael, C. Vernade, A. Carpentier, and M. Valko, “Stochastic bandits with arm-dependent delays,” in International Conference on Machine Learning.   PMLR, 2020, pp. 3348–3356.
  35. J. Zimmert and Y. Seldin, “An optimal algorithm for adversarial bandits with arbitrary delays,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 3285–3294.
  36. S. Bubeck and A. Slivkins, “The best of both worlds: Stochastic and adversarial bandits,” in Conference on Learning Theory.   JMLR Workshop and Conference Proceedings, 2012, pp. 42–1.
  37. C.-Y. Wei and H. Luo, “More adaptive algorithms for adversarial bandits,” in Conference On Learning Theory.   PMLR, 2018, pp. 1263–1291.
  38. N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “Nonstochastic bandits with composite anonymous feedback,” in Conference On Learning Theory.   PMLR, 2018, pp. 750–773.
  39. C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder, “Bandits with delayed, aggregated anonymous feedback,” in International Conference on Machine Learning.   PMLR, 2018, pp. 4105–4113.
  40. S. Wang, H. Wang, and L. Huang, “Adaptive algorithms for multi-armed bandit with composite and anonymous feedback,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 10 210–10 217.
  41. X. Zhang, S. Chen, Y. Zhang, Y. Im, M. Gorlatova, S. Ha, and C. Joe-Wong, “Optimal network protocol selection for competing flows via online learning,” IEEE Transactions on Mobile Computing, 2022.
  42. H. Ma, R. Li, X. Zhang, Z. Zhou, and X. Chen, “Reliability-aware online scheduling for dnn inference tasks in mobile edge computing,” IEEE Internet of Things Journal, 2023.
  43. J. L. W. V. Jensen, “Sur les fonctions convexes et les inégalités entre les valeurs moyennes,” Acta mathematica, vol. 30, no. 1, pp. 175–193, 1906.
  44. S. Borst, V. Gupta, and A. Walid, “Distributed caching algorithms for content distribution networks,” in 2010 Proceedings IEEE INFOCOM.   IEEE, 2010, pp. 1–9.
  45. K. Poularakis, G. Iosifidis, V. Sourlas, and L. Tassiulas, “Exploiting caching and multicast for 5g wireless networks,” IEEE Transactions on Wireless Communications, vol. 15, no. 4, pp. 2995–3007, 2016.
  46. B. Chen and C. Yang, “Caching policy for cache-enabled d2d communications by learning user preference,” IEEE Transactions on Communications, vol. 66, no. 12, pp. 6586–6601, 2018.
  47. X. Zhang, Y. Wang, R. Sun, and D. Wang, “Clustered device-to-device caching based on file preferences,” in 2016 IEEE 27th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC).   IEEE, 2016, pp. 1–6.
  48. T. Lancewicki, S. Segal, T. Koren, and Y. Mansour, “Stochastic multi-armed bandits with unrestricted delay distributions,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5969–5978.
Citations (1)

Summary

We haven't generated a summary for this paper yet.